The Comparability of Subjective Scales

By MichaelPlant @ 2020-11-30T16:47 (+79)

This post summarises my new working paper, A Happy Possibility About Happiness (And Other Subjective) Scales: An Investigation and Tentative Defence of the Cardinality Thesis. It is a cross-post from the HLI website.

TL;DR Can we compare individuals' self-reported data, or is (say) one person's 5/10 the same as another person's 8/10? While this 'the elephant in the room' for happiness research, I was surprised to learn very little has been written about it; so little, in fact, I couldn't even find a clear statement of what the problem is supposed to be. I explain what the problem is, propose a new theoretical justification (based on 'Schelling points') for why we should expect self-reported data to be comparable, suggest the (thin-ish) evidence base supports this theory, then set out directions for future work.

Below is 1,000 word summary of the 12,000 word paper.

Summary

It is very common for people to give numerical ratings of their subjective experiences. For instance, people often score their happiness, life satisfaction, job satisfaction, health, pain, the movies they watch, and so on, on a 0-10 scale.

A long-standing worry about these self-reports is whether the numbers represent the same thing to different people at different times. For example, if two people say they are 5/10 happy, can we assume they are as happy as each other? More technically, the basic (but not only) question is whether subjective scales are cardinally comparable: does a one-point change, on a given scale, represent the same size change?

If the scales are merely ordinal - that is, the numbers represent a ranking but contain no information on the relative magnitudes of differences - we are in trouble: it would not be possible to use subjective scales to say what would increase overall happiness.

Opinions about the Cardinality Thesis - the idea that subjective scales are cardinal - seem to divide on disciplinary lines: economists are sceptical of it, psychologists less so (Ferrer‐i‐Carbonell and Frijters, 2004). Despite this division, and the fact this is a foundational methodological issue, it seems to have attracted little attention in the literature. Researchers will tend to assume the scales are ordinal or cardinal, and so use different statistical tests, without defending their decision (Kristoffersen, 2011). This lack of scrutiny is likely explained by a combination of two things. First, the topic is very applied for philosophy and very theoretical for social science, so falls into an interdisciplinary 'no man's land'. Second, researchers assume any differences in scale use will 'wash out' as noise in large samples anyway and so this problem can be ignored.

The result is that, at present, there is insufficient conceptual clarity around the Cardinality Thesis to know whether or not there is a problem or what could be done about it. Quoting Stone and Krueger (2018, p189):

In order to have more concrete ideas about the extent to which this may be a problem, we should have a better idea of why such differences [in scale interpretation] might exist in the first place, and have some theoretical justification for a concern with systematic differences in how subjective well-being questions are interpreted and answered.

Against this background, this paper makes four main contributions.

First, it proposes a novel theoretical explanation for how people interpret scales that draws on philosophy of language and game theory. In brief: conversation is a cooperative endeavour governed by various maxims (Grice, 1989). As subjective scales are vague and individuals want to be understood, scale interpretation should be understood as a search for a ‘focal point’ (or ‘Schelling point’), a default solution chosen in the absence of communication (Schelling, 1960). A specific focal point is proposed: when given an undefined scale with a finite number of options, individuals (unconsciously) use the end-points to refer to the realistic limits of that quantity, e.g. 10/10 is maximum happiness anyone feels. Further, they interpret the scale as linear, so each point represents the same change in quantity. If this hypothesis is correct, self-reports will be cardinally comparable (given one further assumption, phenomenal cardinality, defined below).

Second, the paper then states four conditions that are individually necessary and jointly sufficient for cardinal comparability to hold on average. Stated roughly, these are:

C1: phenomenal cardinality (the underlying subjective state, e.g. happiness, is felt in units)
C2: linearity (each reported unit change represents the same change in magnitude)
C3: intertemporality (each individual uses the scale the same way over time and the scale end-points represent the real limits)
C4: interpersonality (different individuals use the scale the same way and the scale end-points represent the real limits).

What should we conclude about the cardinal comparability of subjective data if the conditions fail? It depends which condition(s) fails. The first condition is binary and fundamental: each subjective phenomenon is either felt in units or it is not. Of course, a non-cardinal phenomenon cannot be measured on a cardinal scale. However, all the other conditions can fail by degree and what’s important is by how much they deviate. By analogy, C1 is about whether we can have a measuring stick at all; C2 concerns whether the measuring sticks are bent, C3 is if the length of each of each stick changes over time, and C4 is whether different people have the same length sticks. It makes a difference if our measuring sticks are slightly bent or very crooked. The cardinality thesis could fail to be exactly true, but nevertheless be approximately true, such that it is unproblematic to treat it as true.

Third, the paper notes we can use evidence and reasoning to assess whether, and to what extent, the conditions hold, even though they concern subjective states. While it is always true that more than one hypothesis will fit the facts, that does not mean all hypotheses are equally likely. Here, as elsewhere, we rely on inference to the best explanation. It then goes on to examine each condition in turn, primarily drawing on the subjective well-being literature. In each case, there is evidence indicating the condition does hold and no strong evidence suggesting it does not. As such, the tentative conclusion of the paper is that subjective scales are best understood as cardinally comparable, unless and until other evidence suggests otherwise.

Fourth, it sets out some testable predictions of the theory and explains how such tests could be used to ‘correct’ the data if people do not intuitively interpret subjective scales in the way hypothesised.

The conclusion of this paper is therefore optimistic. Not only does there not seem to be a problem where we feared there might be one, but we may well be able to fix the problem if we later discover it does exist.

This research was produced by the Happier Lives Institute.

If you like our work, please consider subscribing to our newsletter.

You can also follow us on Facebook, Twitter, and LinkedIn

MichaelPlant @ 2020-11-30T16:31 (+14)

Just to flag, this topic has been the subject of three recent forum posts in the last 6 months. This paper addresses the concerns raised there.

Milan Griffes asks whether SWB scales might shift over time (intertemporal cardinality) and Fin Moorhouse shared his dissertation on the same topic.

Aidan Goth, in a post which commented on a forum post by the Happier Live Institute ("Using Subjective Well-Being to Estimate the Moral Weights of Averting Deaths and Reducing Poverty”) wonders whether subjective scales are comparable across people (interpersonal cardinality).

In this paper, I argue the scales are likely to be cardinally comparable both over time and across people. This is something of a bold claim to make and, if true, is pretty important, because it means we can basically interpret subjective data at face value, rather than worrying about having to make fancy adjustments based on e.g. the nationality of the respondents.

MichaelStJules @ 2020-11-30T21:57 (+11)

It seems to me that a full defense of cardinality and comparability across humans should mention neuroscience, too. For example, we know that brain sizes differ (in total and in regions involved in hedonic experiences) in certain systematic ways, e.g. across ages and between genders. However, these differences are mostly small (brain size is pretty stable after adolescence, although there are still major changes up until 25-30 years old), and we might assume that differences in intensity of experience scale at most roughly 1:1 with size/connectivity and number of neurons firing (in the relevant regions), and while I think this is more likely to be true than not, I'm still not confident in such an assumption.

However, we might also expect some brains to just be more sensitive than others, without expressed behaviour and reports matching this. For example, if two people say they're having 7/10 pains (to an experience with the same physical intensity, e.g. touching something very cold, at the same temperature), but one person's brain regions involved in the (negative) affective experience of pain is far more active, then I would guess that person is having a more negative experience. It would be worth checking research on this. I guess this might be relevant, although it apparently doesn't track the affective component of pain.

MichaelPlant @ 2020-12-01T10:46 (+5)

Ah, that's a nice point. I discuss in 5.5 in the paper. Quote:

The final condition is whether different individuals use the same endpoints at a time [. There are two types of concern here.
The first is whether there are what Nozick (1974, 41) called ‘utility monsters’, individuals who can and do experience much greater magnitudes of happiness (or any other sort of subjective state), than others.
I won’t dwell on this as it seems unlikely there would be substantial differences in humans’ capacities for subjective experiences. Presumably there are evolutionary pressures for each species to have range of sensitivity that is optimal for survival. To return to an example noted earlier, being immune to pain is an extremely problematic condition that would put someone at an evolutionary disadvantage. Further, even if there are differences, we would expect these to be randomly distributed, in which case they would wash out in large samples

So to generate a serious worry that there's a problem at the level of group averages (which is the relevant level for most relevant decision-making) you'd have to argue for and explain the existence of non-trivial difference between groups. It's tricky to think of real life cases outside people who have genetic conditions. But this wouldn't motivate us thinking, say, members of two nations have different capacities.

RobertDaoust @ 2020-11-27T20:22 (+8)

Thanks, Michael, that deserves an entry in https://docs.google.com/document/d/1OTCQlWE-GkY_V4V-OfJAr7Q-vJyIR8ZATpeMrLkmlAo/edit#

david_reinstein @ 2022-06-24T22:40 (+2)

Perhaps the sceptic thinks, for some reason, we should give up on happiness data altogether - they don’t buy the construct validation story. But, why stop here? We ask people for subjective ratings all the time: Uber drivers, restaurants, job satisfaction, mental health diagnoses, pain scores, etc. Are these all nonsense too? Surely not.

But Uber is not claiming that a movement from a 4 to a 5 is equally valuable as a movement from a 3 to a 4, are they?

david_reinstein @ 2021-11-05T22:51 (+1)

I'm concerned that there seems to be some elision of cardinality and comparability here.

Couldn't it be possible that

everyone's stated happiness levels in fact are comparable in an ordinal sense, so we know that someone who states "5/10" is happier than someone who states "4/10"

but

these scales are not cardinal so we cannot know if

"a movement from 1-2 is the same increase in happiness as an increase in 4-5 (or 9-10)" ...
(nor any other known correspondence ... such as 'log self reported happiness changes are equivalent'

This is important because if we want to compare policies that tend to increment happiness from different starting points, we need to know "is moving from 1-3 as valuable as moving from 4-6" etc.

Matt g @ 2021-02-14T18:16 (+1)

Thanks for writing this Michael, it seems to be a really important topic to have explored. I particularly respect this conclusion: "Not only does there not seem to be a problem where we feared there might be one, but we may well be able to fix the problem if we later discover it does exist."

My impression is that it's very rare for academics to do research into a topic and then write a conclusion that says "actually this isn't that important, and it might not be a problem". In fact, I think academics might often overstate the importance of the problem they are working on. I think a few reasons for this could be:

-Academics might feel more satisfied in their own work if they feel they're solving an important problem, and less satisfied working on problems that turn out to be unimportant.

-Academics gain social and professional credit for working on 'important' problems, and loose it for working on problems that turn out to be not important.

-Academics stand to gain further research funding if they can convince people the problem they're working on is important and pressing.

-Academics often work intensely on one field of research and don't see the 'bigger picture', which leads they genuinely believe whatever they're working on is more important than it is.

Jamie_Harris @ 2020-12-05T07:45 (+1)

Interesting! I would have thought you could test this empirically? For example, though it wouldn't tell you the answer to this question, it would be informative if someone:

Created vignettes about people in various conditions (e.g. with certain diseases) and asked people to rate those people's "happiness, life satisfaction" etc. Ask people in different countries and check for systematic differences by country. Ask the same people at different time points to see how much variation there is in the answers from time to time.
(Less useful?) Showed people vignettes about people in various conditions or about different events and stories, didn't ask any questions about those vignettes, but then asked participants (in different countries or at different time points) to rate their own "happiness, life satisfaction" etc. The test here is whether the effect of external stimuli on these ratings is similar in different cultures and across time, or whether the effects vary systematically.
Asked people questions about their own "happiness, life satisfaction" etc, then asked them to qualitatively describe what 1 and 10 would look like. Do some sort of content analysis of the answers.

(I've only read your summary and I'm not familiar with the literature, so apologies if people already talk about this sort of thing or have run these sorts of studies.)

MichaelPlant @ 2020-12-05T12:04 (+3)

Hello Jamie. Thanks for your astute comment! The paper is quite long and I do cover all of this apart from your third bullet point.

We can't objectively measure subjective states and this seems to have led some people to think that you can't use any empirical evidence at all. But you're right that if you make some assumptions e.g. about vignettes, then if the data go one way that raises your confidence in there being/not being cardinality. This approach is just the basic "inference to the best explanation" used across the sciences (one might even say it's the fundamental method of science).

I discuss vignettes specifically in section 5.5. What you suggest has been done. Angelini et al (2014) asked people their own life sat, then show them this (and another) vignette

John is 63 years old. His wife died 2 years ago and he still spends a lot of time thinking about her. He has four children and ten grandchildren who visit him regularly. John can make ends meet but has no money for extras such as expensive gifts for his grandchildren. He has had to stop working recently due to heart problems. He gets tired easily. Otherwise, he has no serious health conditions. How satisfied with his life do you think John is?

And then asked people to rate how satisfied John is. The is we can assume 'vignette equivalence' - everyone will agree how satisfied John is - use that to make inferences about differential scale use and therefore adjust each individual's scores. The issue, as I say (p24) is that:

However, respondents do not seem to agree [how satisfied John is]. For instance, Angelini et al. (2014) find about 30% of Germans rate ‘John’ from the above vignette as satisfied or very satisfied, but 30% rate him dissatisfied or very dissatisfied. To assume that the respondents agree about John’s life satisfaction requires us to conclude that respondents must mean the same thing by “satisfied” as “dissatisfied”, which strains credulity seeing as one is positive and the other negative. Faced with a choice of vignette equivalence or semantic equivalence (that respondents attach the same meaning to words) the latter seems more plausible

The general point people we need to think carefully about which assumptions we take as 'ground truths' to test to cardinality. Vignette equivalence is, I think, not rock solid.

Re your third bullet point, I think it would be really hard to do it that way around - I can't see any way to use that to a numerical interpretation from the answers, which is what's needed.