Why CEA has stopped using Net Promoter Score

By Ben_West🔸 @ 2022-02-03T15:43 (+85)

Net Promoter Score is a widely used method for determining consumer satisfaction, asking “How likely is it that you would recommend [brand] to a friend or colleague?” and the response is (usually) a number between 0 and 10. However, instead of an average, the aggregate score is a complex nonlinear function of the results. CEA has moved away from this complex function in favor of just simply taking the arithmetic mean. Briefly, this is because the results don’t replicate, NPS is not empirically supported, it requires larger sample sizes, and it violates survey best practices.

Summary

NPS is widely used, but the research has failed to replicate, even when the replication was using the originally published data set (!).
Measures of satisfaction are more predictive than NPS of outcomes such as firm growth and whether the respondent actually recommends the product to others.
The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:^[1]
1. What is your overall satisfaction with X?
2. To what extent has X met your expectations?
3. How well did X compare with the ideal (type of offering)?
CEA mostly still asks the NPS question, but switched to taking the arithmetic mean of the results. We call this the “likelihood to recommend” (LTR).^[2]

More information

NPS was introduced in 2003 with the claim that it was the best predictor of growth across a data set of companies. This data set was small and subject to p-hacking. The raw data has not been published (including, ironically, the pieces the author says should always be published when reporting NPS scores). The original research methodology was:

“We then obtained a purchase history for each person surveyed and asked those people to name specific instances in which they had referred someone else to the company in question… The data allowed us to determine which survey questions had the strongest statistical correlation with repeat purchases or referrals….
One question was best for most industries. “How likely is it that you would recommend [company X] to a friend or colleague?” ranked first or second in 11 of the 14 cases studies”^[3]

Replication attempts (including ones which reverse engineered the original data set from published scatterplots) have failed to find significant predictive value from NPS. A wide variety of alternative statistical methods exist, some of which have stronger empirical grounding.
1. Notably, NPS is worse at predicting whether the respondent will actually recommend the product.
Replication attempts find alternate definitions of the NPS scale to be more predictive than the commonly used one, even if the question is kept the same (e.g. using a 7 point scale).
The weird way NPS is calculated means that it requires substantially larger sample sizes.
The NPS question disagrees with commonly accepted best practices in survey design (e.g. using an 11-point scale instead of a 5-point one).
There doesn’t seem to be any particular reason to think that NPS is good, apart from it being widely used.
So if it’s so terrible, why does everyone use it? This Wall Street Journal article implies that it is used precisely because it’s so easy to manipulate: “Out of all the mentions the Journal tracked on earnings calls, no executive has ever said the score declined.”^[4]

^{^}
(Note: different sources seem to use slightly different wording and I’m not sure what the “official” wording is because it’s proprietary. Also, the official version uses a proprietary weighting of these three questions but people online seem to think the weights are approximately equal.)
^{^}
We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.
^{^}
“ranked first or second in 11 of the 14 cases studies” should already be setting off alarm bells
^{^}
Of course, this doesn’t explain why investors allow executives to tie their compensation to easily hackable metrics.

David_Moss @ 2022-02-03T19:30 (+11)

Cool! Glad to see this, I've been harping on about the NPS for some time (1, 2, 3, 4).

We usually do this because we don’t want to take people’s time up by asking three questions. I haven’t done a very rigorous analysis of the trade-offs here though, and it could be that we are making a mistake and should use ACSI instead.

As you may have considered, you could ask just one of the ACSI items, rather than asking the one NPS item. This would have lower reliability than asking all three ACSI items, but I suspect that one ACSI item would have higher validity than the one NPS item. (This is particularly the case when trying to elicit general satisfaction with the EA community, but maybe less so if you literally want to know whether people are likely to recommend an event to their friends).

The added value of using three items to generate a composite measure is potentially pretty straightforward to estimate, esp if you have prior data with the items. Happy to talk more about this.

Ben_West @ 2022-02-03T19:58 (+3)

Thanks David! If you have references or could say more about the virtues of asking one ACSI question versus the NPS question, I would love to read/hear them.

David_Moss @ 2022-02-09T17:58 (+13)

Hi Ben.

There are two broad reasons why I would prefer the ACSI items (considered individually) over the NPS (style) item:

The ACSI items are (mostly) more face valid
The ACSI items generally performed better than the NPS when we ran both of these in the EAS 2020

Face validity

This depends on what you are trying to measure, so I’ll start with the context in the EAS, where (as I understand it) we are trying to measure general satisfaction with or evaluation of the EA community.

Here, I think the ACSI items we used (“How well does the EA community compare to your ideal? [(1) Not very close to the ideal - (10) Very close to the ideal]” and “What is your overall satisfaction with the EA community? [(1) Very dissatisfied - (10) Very satisfied]”) more closely and cleanly reflect the construct of interest.

In contrast, I think the NPS style item (“If you had a friend who you thought would agree with the core principles of EA, how excited would you be to introduce them to the EA community?”) does not very clearly or cleanly reflect general satisfaction. Rather, we should expect it to be confounded with:

Attitudes about introducing people to the EA community (different people have different views about how positive growing the EA community more broadly is)
Perceived/projected personal “excitement” (related to one’s (perceived) emotionality, excitability etc.)
Sociability/extraversion/interest in introducing friends to things in general, as well as one’s own level of social engagement with EA (if one is socially embedded in EA, introducing friends might make more sense than if you are very pro EA, but your interaction with it is entirely non-social)

I think some of these issues are due to the general inferiority of the NPS as a measure of what it’s supposed to be measuring:

And some of them are due to the peculiarities of the context where we’re using NPS (generally used to measure satisfaction with a consumer product) to measure attitudes towards a social movement one is a part of (hence the need to add the caveat about “a friend who you thought would agree with the core principles of EA”).

Some of the other contexts where you’re using NPS might differ. Likelihood to recommend may make more sense when you’re trying to measure evaluations of an event someone attended. But note that the ‘NPS’ question may simply be measuring qualitatively different things when used in these different contexts, despite the same instrument being presented. i.e. asking about recommending the EA community as a whole elicits judgments about whether it’s good to recommend EA to people (does spreading EA seem impactful or harmful etc?), whereas asking about recommending an event someone attended mostly just reflects positive evaluation of the course. Still, I slightly prefer a simple ACSI satisfaction measure over NPS style items, since I think it will be clearer, as well as more consistent across contexts.

Performance of measures

Since we included both the NPS item and two ACSI items in EAS 2020 we can say a little about how they performed, although with only 1-2 items and not much to compare them to, there’s not a huge amount we can do to evaluate them.

Still, the general impression I got from the performance of the items last year confirms my view that the two ACSI measures cohere as a clean measure of satisfaction, while NPS and the other items are more of a mess. As noted, we see that the two ACSI measures are closely correlated with each other (presumably measuring satisfaction), while the NPS measure is moderately correlated with the ‘bespoke’ measures (e.g. “I feel that I am part of the EA community”) which seem to be (noisily) measuring engagement more than satisfaction or positive evaluation. I think it’s ultimately unclear what any of those three items are measuring since they’re all just imperfectly correlated with each other, engagement and with satisfaction, so I think they are measuring a mix of things, some of which are unknown. Theoretically, one could simply run a larger suite of items, designed to measure satisfaction, engagement, and other things which we think might be related (such as what the bespoke measures are intended to measure) and tease out what the measures are tracking. But there’s not a huge amount we can do with just 5-6 items and 2-3 apparent factors they are measuring.

Benefits of multiple measures

As an aside, we put together some illustrations of the possible concrete benefits of using a composite measure of multiple items, rather than a single measure.

The plot below shows the error (differences between the measured value and the true value: higher values, in absolute terms, are worse) with a single item vs an average made from two or three items. Naturally, this depends on assumptions about how noisy each item is and how correlated each of the items are, but it is generally the case that using multiple items helps to reduce error and ensure that estimates come closer to the true value.

This next image shows the power to detect a correlation of around r = 0.3 using 1, 2 or 3 items. The composite of more items should have lower measurement error. When only a single item is used, the higher measurement error means that a true relationship between the measured variable and another variable of interest can be harder to detect. With the average of 2 or 3 items, the measure is less noisy, and so the same underlying effect can be detected more easily (i.e., with fewer participants). (The three different images just show different standards for significance)

vaidehi_agarwalla @ 2022-02-19T18:15 (+7)

I just wanted to say that I always appreciate your in-depth responses David! They are always really easy to follow and informative :)

AlasdairGives @ 2022-02-04T12:35 (+3)

I'd also be interested in this!

Jamie_Harris @ 2022-10-16T10:43 (+5)

Hello, since I saw this post, I switched a couple of things to using ACSI. I always thought NPS seemed pretty bad, and mostly only included it for comparison with groups like CEA who were using it.

Do you have any data you're able to share publicly yet?

Additionally:

The American Customer Satisfaction Index is an alternative which has stronger empirical grounding, as well as a huge number of publicly available benchmarks. It uses 3 questions, on a 10 point scale, whose scores are averaged and normalized to a 0-100 scale:[1]

How exactly are you calculating it? The Wikipedia formula seems wrong to me, unless I'm misunderstanding it.

(I have 9 answers for each of the three questions. The average responses are 9.4, 9.6, and 9.3. So I think what I'm supposed to do is =((9.4*1+9.6*1+9.3*1)-1)/9*100 . This gives me "303.7037037" which clearly seems wrong.)

My interpretation of what it should be:

=(((9.4+9.6+9.3)-3)/27)*100

Which equals 93.8. The simpler but slightly less accurate =((9.4+9.6+9.3)/3)*10 comes out similarly, at 94.4.

Which seems very good. E.g. "Full-Service Restaurants", "Financial Advisors", and "Online News and Opinion" all seem to hover around 70-80, while government services range a bit more widely from 60 to 90.

(Caveat that I didn't realise that you were supposed to include labels on 1 and 10 for each of the questions until I checked the Wikipedia entry just now to calculate it, and I'm not sure how this would affect the results. The labels seem pretty weird to me, so I suspect it does affect it somehow.)

Thanks!

Manuel_Allgaier @ 2022-04-11T12:37 (+1)

Appreciate this update!

> NPS [...] violates survey best practices.

Agree. For our EA retreats in Germany, we've also always just used the mean. I'm surprised that NPS is so widely used in industry.

Why CEA has stopped using Net Promoter Score

Summary

More information

Further Reading