Thoughts on the Reducetarian Labs MTurk Study

By Peter Wildeford @ 2016-12-02T17:12 (+21)

When it comes to helping nonhuman animals in factory farms, there’s a lot of things people can do. But when figuring out what is best, we currently have to rely entirely on our intuitions. The hope is that, in the future, empirical research will at least be able to strongly guide our intuitions and improve the effectiveness at which we work as well as our understanding of how to achieve victories for animals.

Prior work has either had an insufficiently large sample size to find significant effects (e.g., ACE 2013 Humane Education Study, MFA 2016 Online Ads Study), had an insufficient control group size (e.g., ACE 2013 Leafleting Study, 2014 FARM Pay-per-view Study, VO 2015 Leafleting Study; Ferenbach, 2015; Hennessy, 2016), had no control group (e.g., THL 2011 Online Ads Study; THL 2012 Leafleting Study; THL 2014 Leafleting Study; THL 2015 Leafleting Study; VO 2015 Pay-per-view Study; James, 2015; Veganuary 2016 Pledge Study), only measured intent to change diet rather than actual self-reported diet change (e.g., MFA 2016 MTurk Study), or did not attempt to measure differences against a baseline prior to the intervention (e.g., VO 2014 Leafleting Study; CAA 2015 VegFest Study; VO 2016 Leafleting Study; Ardnt, 2016). (This is all the veg research I know of, if you can name any more, I’d be happy to add them to this literature overview.)

However, now, the Animal Welfare Action Lab and Reducetarian Labs teamed up, with help from myself and Kieran Greig, to produce the 2016 AWAL Newspaper Study (see announcements from the Reducetarian Foundation and from AWAL), and found the first statistically significant effect on meat reduction using an actual, and sizable, control group. This is a very amazing contribution, and I applaud the whole team for doing it well!

In this document, I aim to re-analyze the study for myself. In doing so, I use their data but my own statistical analysis to replicate the effects I care most about. I also recreate all the analysis in this writeup, rephrasing it in my own words (so I and others can understand from slightly different perspectives).

What is the basic methodology?

Participants were recruited through Amazon Mechanical Turk, which is a broadly representative sample of the US population that can be recruited for online surveys at low cost. 3076 participants were recruited to take a baseline food frequency questionnaire with a variety of self-reported behavior and attitude questions. However, for this re-analysis, I’m only going to dig into the questions about how many servings of meat they self-report eating.

One week after that, participants were recontacted to view a newspaper article. One third of participants randomly saw an article advocating vegetarianism, another third of participants randomly saw an article advocating reducetarianism (eating less meat), and a final third of participants saw a control article advocating exercise. Immediately after viewing the article, participants were asked to report their diet.

Five weeks after that, participants were recontacted and asked to report their diet again.

Exact copies of the newspaper articles, copies of the survey, and full methodology are available in the study writeup.

What are the headline results?

Participants in the treatment groups, on average, changed their diet to eat 0.8 less servings of turkey, pork, chicken, fish, and beef per week than those in the control group.

There was a statistically significant difference between the three groups (ANOVA, p = 0.03) and between treatment (pooled) and control (chi-square test, p = 0.001).

There is no statistically significant difference between the reducetarian message and the vegetarian message on diet change (chi-square test, p = 0.09).

The presence of a control group is important. Without looking at the control group, there appear to be significant decreases in vegetarian rates, but when taking the control group into account these decreases disappear.

There were other effects on intentions too, but I prefer to focus on the diet change since it’s a lot more important and exciting to me. For more, feel free to check out the original study.

How should we cautiously interpret these results?

The study reports an average of eating one less serving of meat for the treatment groups, but I think it is more clear to split this into three groups -- of the 1422 people in the treatment groups, 258 people (18.2%) had no change, 538 (37.8%) increased an average of 6.94 servings per week of meat on average, and 626 (44.0%) people decreased an average of -7.66 servings per week.

For the 702 people in the control group, 138 people (19.7%) had no change, 286 (40.7%) people increased an average of 6.90 servings per week, and 278 (39.6%) decreased an average of -6.34 servings per week.

Thus the treatment and control group differ in both the number of people who ultimately decide to reduce and the magnitude of the average reduction for those that do end up reducing. Breaking it up like this, the number of people who reduce (holding the magnitude of reduction constant) is insignificant between groups after controlling for multiple hypothesis testing using Benjamini-Hochberg procedure (t-test, p = 0.0585), and the magnitude of reduction among those who do reduce (holding the number of people reducing constant) is significant (t-test, p = 0.007).

We can then keep in mind that the actual magnitude of change is, for many people, a lot more than one serving per week, even if it is one serving per week on average for the entire group.

All that being said, it’s still unclear to what degree we can take these numbers literally, since people can’t correctly recall the precise numbers of servings they have eaten over the past month, and because there is a good deal of fluctuation in diets. However, it does look like this reduction effect survives a few different ways of looking at it and also survives my independent data re-analysis, such as by looking at the magnitude of reduction and a binary value of whether or not there was any reduction; by looking across ANOVA, chi-square, and t-tests; and by either pooling or not pooling the treatment groups.

I also did an extra sanity check -- did the treatment cause people to also reduce on fruits, nuts, vegetables, beans, and grains? Or maybe to increase on vegetables due to social desirability bias? The answer is no on the first (chi-squared, p = 0.7933) and no on the second (chi-squared, p = 0.1778), both of which are good for the results of this study.

When reducing, are people just shifting away from beef and toward chicken?

In a word, no.

Among those who reduced beef, there was also a -0.7 servings per week reduction of chicken in the control group and a -2.2 reduction of chicken in the treatment group.

Similarly, among those who reduced chicken, there was also a -1 servings per week reduction of beef in the control group and a -2.3 reduction of beef in the treatment group.

I’m not making any claims of statistical significance between treatment and control groups for this, but it is pretty clear that people aren’t shifting from beef to chicken, but rather just reducing across the board.

What did this study not find?

Notably to me, while the magnitude of people eating less meat is significant, this study found no effect on people cutting out meat entirely (even though the vegetarian appeal suggested doing that). Inferring vegetarianism from people who reported no servings consumed of beef, turkey, chicken, fish, and pork, in the treatment group nine people start vegetarianism and twelve people stop and in the control group four people start vegetarianism and three people stop. This difference is not significant among the groups (ANOVA, p = 0.26) or among treatment vs. control (chi-square test, p = 0.55).

What implication does this have for our existing strategies?

I’m going to speculate pretty wildly here and afterwards I’ll mention appropriate disclaimers that walk back my speculation.

According to ACE’s research which relies on a re-analysis of a 2012 THL study, we come up with expectations that 3.3% of people shown a leaflet or online ad will stop eating red meat, 1.6% will stop eating chicken, 1.0% will stop eating fish, 0.4% will stop eating eggs, and 0.6% will stop eating dairy.

According to this study, 2.7% of people shown a newspaper story about vegetarianism or reducetarianism stop eating red meat. However, notably 2% of people shown the story about exercise also stop eating red meat. This means that we would estimate the true effect of the newspaper study to be a net change of 0.7 percentage points. (Presumably, the 2% change from the control story comes not from people changing their diet once convinced about the power of exercise, but from just a general trend toward vegetarianism over time unrelated to any news story, and/or from social desirability bias, and/or a mix of other factors.)

Going down the list, there is a -0.4 percentage point change for eliminating chicken (once the treatment group is compared to the control group), and -0.1 percentage points for fish, +2.0 percentage points for eggs (meaning people in the treatment group ate more eggs than people in the control group -- this could be a substitution effect or it could be just random noise), and -0.6 percentage points for dairy. These stark departures from the other study remind the importance of including a control group. Also, don’t take these numbers literally because none of them were statistically significantly different from +0 percentage points (no change).

I’m not going to plug those numbers into the calculator and call it a day, though, because the reality is that there is no statistically significant difference between groups on individual food items, likely because of the lower sample sizes and high within-item variability. The pattern only emerges on the larger scale of reduction across all food.

It’s also worth noting that a newspaper article on MTurk is much different than a leaflet handed out in person along with an in-person survey. MTurk can be better in some respects, since you’re less likely to have a nonresponse bias in who answers your survey as it’s easier to follow up with everyone. Also, you have a much better knowledge of who was actually in your treatment and control groups. On the other hand, a newspaper article is less persuasive than a leaflet or video, and the fact that people are paid to read it and may expect comprehension questions creates an unrealistic effect that won’t be present in real life.

Taking this a different way, we may want to focus on reduction instead of elimination, since that’s what this study was about. However, ACE numbers are about how many animals are spared, whereas we only know about the number of servings that are not eaten (and even that may be hard to take literally).

To simplify things for myself, I’m going to look just at chicken consumption, since chickens are the vast majority of factory farmed animals. From this study, we found that participants reported eating an average of 4.9 servings of chicken per week at baseline. Given that a serving of chicken is approximately 3 ounces, 4.9 servings of chicken per week is 0.41kg of chicken per week, or 21.32kg per year. As a sanity check this matches up well with the USDA reporting (p15) that Americans ate 24kg of chicken per year in 2000.

A chicken weighs 1.83kg, so taking this survey data literally would mean that survey respondents are consuming 11.6 chickens per year and respondents in the treatment group reduce their consumption by 0.26 servings per week, which assuming treatment effects continue to hold and don’t decline (a strong assumption) and projecting those effects out annually, would be a reduction of roughly 1.1 chickens per year per respondent.

Assuming the newspaper ad over MTurk is the same as a leaflet or an online ad (another strong assumption), we could project to 1.1 chickens saved per $0.35, or 3.1 chickens spared per dollar, which is quite close to ACE’s estimate of 3.6 chickens spared per dollar.

Going off the speculative deep end, we can note that a factory farmed broiler chicken lives for 42 days on average, so 3.1 chickens spared per dollar is 130 days of factory farmed suffering averted per dollar, which is $2.81 per chicken DALY.

However, I want to strongly caution against taking these numbers very literally, since there still is a lot we don’t know. The serving sizes reported by our sample are not close to literal numbers for a variety of reasons, and have a lot of noise and fluctuation. Also, there are a lot of differences between MTurk and the real world, and MTurk might not accurately reflect how our materials do against the real public.

Lastly, it’s also interesting to note that the reducetarian message and the vegetarian message resulted in roughly the same (no statistically significant difference) amount of meat reduction. However, this claim should not be taken too literally, as it could easily be due to the sample size not being large enough to pick up a small difference between the two messages. This was similar to a finding in Ferenbach (2015) where both videos tested produced roughly the same amount of behavior change, though the sample sizes in that study were even smaller.

Why did this study work when others haven’t?

Earlier I mentioned that all previous studies have been held back by having inadequate sample sizes or not having a control group. Through the magic of a control group and an adequate sample size, this study prevailed. Pretty simple!

One way statistically significant effects could be found in a smaller sample was through a randomized block design. This was used by AWAL to reduce the variance in the outcome measure, which increases statistical power.

This study had a control group of 742 people and a treatment group of 1495 people. Combined, that’s about 20% larger than the previous largest study, the MFA 2016 Online Ads Study, with a treatment group of 934 people and a control group of 864 people.

What unanswered questions remain?

I’d like to do a more careful power analysis to see exactly how this study surveyed enough people to be effective, especially since it’s not that much larger than other studies.

I’d be curious to also replicate the study analysis without the randomized block design and see if the effects still hold. It could be that statistical power was only achieved through the block design.

It’s also possible the study may have just gotten lucky.

I’d like to look more into the attrition data. Since some people who took the first wave didn’t show up for the second wave and some people who took the second wave didn’t show up for the third wave (despite a much higher compensation), there’s a worry about a nonresponse bias where people predisposed to not like vegetarianism drop out instead of filling out their survey showing their lack of change, which leads us to overstate the amount of vegetarianism.

I’d also like to look more deeply at the way the food frequency questionnaire was used. As my friend and ACE intern Kieran Greig pointed out to me, the FFQ asks the respondent about their meat intake in discrete, ordinal buckets (zero times per week, 0-1 times per week, 1-6 times per week, 1-3 times per day, 4 or more times per day) and these buckets are then transformed into continuous, numerical data (e.g., “zero times per week” became 0 times per week, “1-6 times per week” became 3.5 times per week, “4 or more times per day” became 28 times per week).

However, there are multiple other ways these buckets could have been transformed into continuous data (for example, assuming “4 or more” roughly becomes 4 seems to really underestimate the “or more” part), and it would be quite problematic if the effect failed to replicate under certain methods and not others. I have not yet tested this, but the fact that the study effects do hold under a binary variable (any meat reduction at all versus no or negative meat reduction) is encouraging.

There are also other methods that can be used to analyze the differences in ordinal values between the treatment and control groups and between the baseline and endline waves, such as ordinal logistic regression, that would be able to analyze the data and find statistically significant effects without the need to pick a particular method of transforming the ordinal data into continuous, numeric data. While the output from this model would be very difficult to interpret in terms of amount of meat reduced, we would definitely expect a statistically significant effect on this kind of model if the effects of the treatment are real and not just an artifact of the method of transformation used.

While the transformation does introduce complications, I’d generally note that using an ordinal FFQ sounds like a good idea. The ordinal buckets may be less accurate, but I’d expect respondents would find it much easier to fill out rather than trying to recall the precise amount of servings that they ate. Since I wouldn’t really trust the difference between someone self-reporting 4 servings instead of 3, it makes sense to create a sizable bucket where we would expect differences to be meaningful.

I’m somewhat curious how much a newspaper ad approximates a leaflet and I’m somewhat curious to replicate the study again, but with actual leaflets (and including a control leaflet), though others I’ve talked to have been less interested in more MTurk studies.

I and many others would also very much like to see a study on a platform other than MTurk, such as a replication of the MFA 2016 Online Ads Study but with an even larger sample size. We’d really like to see if effects on MTurk hold up in other areas.

The bottom line is that while these results are encouraging, we know that most studies that people try to replicate end up failing to replicate. There are still many unanswered questions here that we’ll only really know as the field of empirical animal rights work continues to evolve.

Disclaimer: I funded 75% of the costs of the study, provided consulting on the study methodology, and continue to be involved in Reducetarian Labs’s empirical work.

Thanks to Krystal Caldwell, Kieran Greig, Brian Kateman, Bobbie Macdonald, Justis Mills, Joey Savoie, and Allison Smith for reviewing an advanced copy of this work.

undefined @ 2016-12-02T22:09 (+3)

Thanks for writing this up.

The estimated differences due to treatment are almost certainly overestimates due to the statistical significance filter (http://andrewgelman.com/2011/09/10/the-statistical-significance-filter/) and social desirability bias.

For this reason and the other caveats you gave, it seems like it would be better to frame these as loose upper bounds on the expected effect, rather than point estimates. I get the feeling people often forget the caveats and circulate conclusions like "This study shows that $1 donations to newspaper ads save 3.1 chickens on average".

I continue to question whether these studies are worthwhile. Even if it did not find significant differences between the treatments and control, it's not as if we're going to stop spreading pro-animal messages. And it was not powered to detect the treatment differences in which you are interested. So it seems it was unlikely to be action-guiding from the start. And of course there's no way to know how much of the effect is explained by social desirability bias.

undefined @ 2016-12-06T22:20 (+2)

A chicken weighs 1.83kg, so taking this survey data literally would mean that survey respondents are consuming 11.6 chickens per year and respondents in the treatment group reduce their consumption by 0.26 servings per week, which assuming treatment effects continue to hold and don’t decline (a strong assumption) and projecting those effects out annually, would be a reduction of roughly 1.1 chickens per year per respondent.

Does your calculation account for the fact that only part of the chicken actually gets converted into meat that is eaten? There are approximately 9 billion chickens slaughtered each year in the United States (a country of roughly 300 million people), so the mean consumption should be around 30 chickens a year.

undefined @ 2016-12-02T23:29 (+2)

Good write-up.

I find the very long list of badly-designed studies you note in the introduction a cause for consternation, and I'm glad this has been done much better.

However, I couldn't see a power calculation in the study, nor in the pre-registration, so I worry the planned recruitment of 3000 was either plucked from the air or decided on due to budget constraint. Yet performing this calculation given an effect size you'd be interested in is generally preferable to spending money on an underpowered study (which I'm pretty sure this is).

Given the large temporal fluctuations (e.g. the large reduction in control group), the pretty modest effects, I remain sceptical - leave alone the obvious biases like social desirability etc. Another reanalysis which might reassure would be monte carlo permutation of food groups: if very few random groups show reduction in consumption to a similar magnitude as meat, great (and, of course, vice versa).

undefined @ 2016-12-06T01:14 (+5)

Here is the main plot from our power calculations that informed our sample size selection (alongside budget constraints): http://i.imgur.com/aeEYagA.png

undefined @ 2016-12-04T19:44 (+2)

I worry the planned recruitment of 3000 was either plucked from the air or decided on due to budget constraint

I want to be careful not to speak for the authors here, but I'm personally pretty sure it was picked by budget constraint, though with an eye to power calculations (that I saw, not sure why they weren't published) suggesting it would be sufficient.

Given the large temporal fluctuations (e.g. the large reduction in control group), the pretty modest effects, I remain sceptical - leave alone the obvious biases like social desirability etc.

Agreed.

Another reanalysis which might reassure would be monte carlo permutation of food groups: if very few random groups show reduction in consumption to a similar magnitude as meat, great (and, of course, vice versa).

In my re-analysis, I did make a "bogustarian" label looking at reduction in beans, fruits, nuts, vegetables, and grains and found no statistically significant results (see https://github.com/bnjmacdonald/reducetarian-messaging-study/blob/master/peter-reanalysis/analysis.R#L124-L132). So maybe that's reassuring, but one could extend this to be a true monte carlo method.

undefined @ 2016-12-07T00:06 (+1)

Major props to the authors on the study and for this follow up write up as well.

For maximizing power per dollar, another technique to consider is the one outlined here, in which researchers first found a subsample of people more responsive to follow-up surveys (of course, there are concerns about external validity):

http://science.sciencemag.org/content/sci/suppl/2016/04/07/352.6282.220.DC1/Broockman-SM.pdf