An Initial Response to MFA's Online Ads Study

By kierangreig🔸 @ 2016-02-19T16:43 (+5)

I think that Mercy For Animals is really leading by example by performing evaluations like this and being so transparent with results and much credit to Peter Hurford, .impact and Jason Ketola for an excellent study design. :)

I don't think that any of the points I make require urgent answers and please know that I am not trying to be critical of MFA or any individuals involved with this study. All of these comments/questions are strictly my own personal views and aren’t that of any of my current or future employers. I think that it’s really worthwhile for everyone to do their best to understand what these results imply for animal advocacy and what we can learn for future studies. I have also attempted to be polite with my tone but this can often be a struggle when communicating online. If I come across as abrasive, rude or confrontational at some points this really wasn’t my intention :).

It’s also probably worth noting that empirical studies are really hard to fully execute and even the most competent professionals sometimes make mistakes. For instance, this year’s winner of the Nobel Prize in economics fairly recently published a paper that seems to have data analysis errors.

I also think it’s great to see that most people in the FaceBook thread are responding in a constructive manner yet disappointing to see some responding with possibly intentionally insulting comments.

My comments/questions ended up becoming pretty long so I copy and pasted them into a Google Document in case it’s easier to give feedback there. In addition I also posted a summary of this post to the original FaceBook thread discussing this study here.

I think that the most important points I make are:

The study is really useful because we now can say that it’s likely that the effect of animal advocacy in this context is less than some threshold.
I don’t think this threshold is a 10% difference between the groups. This seems to be true for the raw data where the number of participants in each group was ~1000 but it seems that there were 684 participants in the control group and 749 in the experimental so I think the threshold is greater than a 10% difference between the groups.
I think that the way this study has been reported has some really admirable aspects.
Some of the wording used to report the study may be slightly misleading.
I think there may be some overestimation of how big a sample size is needed in a study like this to get useful results.
I think the results of this study are reason to think that animal advocacy via online advertising in this context is less effective than I previously thought it to be. This is because this study suggests that it’s unlikely the effects of online advertising in this context are above a threshold which I previously assigned some probability to and I have now lessened the probability that I put on an effect like this in light of the findings of this study. As a result I would direct slightly less resources to online advertising in this context relative to other techniques then I would have prior to being aware of the results of this study.

With that out of the way my extended questions/comments on the reporting of this study and what we can learn for future studies will now begin! :)

Would it be better to do a pre-analysis plan for future studies?

Would it be better to do pre treatment/intervention and post treatment/intervention data collection rather than just post treatment/intervention data collection for future studies? By this I mean something like a baseline survey and an endline survey which seemed to be used in a lot of social science RCTs.

Was it worth using Edge Research to analyze the data for this study? Will external bodies like Edge Research do data analysis for future MFA studies?

Why was the study so low powered? Was it originally thought that online ads were more effective or perhaps the study’s power was constrained by inadequate funding?

“Edge Research then “weighted” the data so the respondents from each group were identical in gender, geography, and age.” I am not totally sure what this means and it seems important. It would be great if someone could please explain more about what the “weighting” process entails.

“Our study was powered to detect a 10 percent difference between the groups…”

I really like how much emphasis is being put on the power of this study. I think too often not enough attention is paid to this aspect of study methodology. Although, I think it might be a little misleading to say that this study was powered to detect a 10 percent difference between the groups but I am not totally sure. I think this statement is true for the raw data which has ~1000 participants in each group but if we are just focused on the 684 participants in the control group and 749 participants in the experiment group I think the study probably has less power than that needed to detect a 10 percent difference between the groups.

“... a 10 percent difference between the groups, and since the differences between the groups were much smaller than that, we can’t be confident about whether the differences between the groups were due to chance or were the true group means.“

This is a good point and I am glad that it is being made here but I worry it could be a little misleading. When I analyzed the percentage difference in animal product consumption of the experiment group relative to the control group I seemed to find an average difference of 4.415%. It might also be worth noting that the percentage difference in pork consumption in the experiment group relative to the control group was 9.865% and the percentage difference in chicken consumption in the experiment group relative to the control group was 7.178%. It depends on one’s reference frame and this might just be semantics but in this context I probably wouldn’t characterize 4.415%, 7.178% and 9.865% as much smaller than 10%.

“However, since other studies have found that showing footage of farmed animal cruelty decreases meat consumption, and since the difference in our study was not close to statistically significant, we think it’s unlikely.”

It is great practice to link to other studies like this when writing up the results of study. I encourage people to continue doing this :) I noticed that “other studies” was hyperlinked to the abridged report of a study titled “U.S. Meat Demand: The Influence of Animal Welfare Media Coverage.” My limited understanding is that that study is looking at the effect that newspaper and magazine articles regarding farm animal welfare has on meat consumption. Based off this, I am not sure what conclusions we can reach about the effect of footage of farmed animal cruelty on meat consumption.

Hmm… I could be wrong but I feel that saying that emphasising the increased animal product consumption results were not close to being statistically significant may slightly conflict with this later statement: “Participants 17–20 and 21–25 in the experimental group appeared to be eating slightly to dramatically more animal products than those of the same age in the control group. However, none of these differences was statistically significant. Had we not applied the Bonferroni correction, nearly half of these differences would have been statistically significant at the 85% or 95% level.” This could be a misunderstanding on my part.

“Because of the extremely low power of our study, we don’t actually know whether the two groups’ diets were the same or slightly different.”

“extremely” might be a bit of an overstatement :) It guess it depends once again on one’s reference frame and possibly semantics. It might be worth mentioning that my understanding is that this is the highest powered study that the animal advocacy movement has completed.

“Based on our study design, it appears we would have needed tens to hundreds of thousands of participants to properly answer this question.”

This may be a little misleading. Taking reported egg consumption which is perhaps the most suffering dense animal product and therefore possible impacts on egg consumption are perhaps the most important thing us and doing a quick two tail test power calculation based on participants reported consumption of egg servings in the past two days using:

Control group mean:1.389

Experiment group mean: 1.447

Ball parking the standard deviation from the reported consumption of egg servings to be ~ 1.79

This gives a sample size required to detect s.s. results 80% of the time at the classic probability level of p=0.05 to be approximately 15,000 participants in total. This sample size seems to clearly be on the lower end of the estimated sample size required and it’s probably good that future studies take a consideration like this into account.

“The bottom line is this: We don’t know whether showing people farmed animal cruelty video in this context will cause them to slightly increase, slightly decrease, or sustain their consumption of animal products a few months later.”

Nice work on clearly communicating the bottom line of the results of the study. This is a great precedent for future studies to ensure that the results aren’t inaccurately interpreted by others which can be a significant problem with studies like this one. I guess I wonder if a more informative bottom line would be something like: “We think that it’s likely that showing people a farmed animal cruelty video in this context will not cause more than a [insert answer from reworked accurate power calculation] overall difference in animal product consumption a few months later compared to someone who is similar in almost all respects but didn’t watch a farmed animal cruelty video in this context.”

“We compared the groups in two ways: (1) with a Bonferroni correction to account for the fact that we were conducting multiple group comparisons, and (2) without the Bonferroni correction.”

Excellent work using the Bonferroni correction! :) Again, I think that is another great precedent for future studies to follow.

“If the “less meat” and “no meat” categories were combined into one general “reducer” category, we would find a statistically significant difference at the 95% level, with the experimental group more likely to intend to eat less meat four months into the future.”

I really like how transparent this is and that it was made clear that categories would have to be combined in order for there to be a statistically significant result. Less transparent reporting of this study could easily have neglected to mention that fact. It might be worth mentioning that I feel as if reporting this result without reporting the somewhat analogous result of the previous section titled “Self-Reported Dietary Change” by combining the categories of “Decreased in last 4 months” and “Did not eat meat then, do not eat meat now” may be slightly inconsistent reporting of results. This could be a misunderstanding on my part though.

“Finally, we asked three questions about attitudes. Answers to these questions have correlated with rates of vegetarian eating and meat reduction in other studies.”

This may be a silly question but what studies are being referred to here? :)

“Discouragingly, we found no statistically significant differences in reported meat, dairy, and egg consumption. But because we powered the study to detect only a 10% difference, we can’t be confident there was no difference or a modest positive or negative difference. Since even a tiny reduction (for example 0.1%) in meat consumption could make an online advertising program worthwhile, there’s no useful information we can take away from participants’ reported food choices.”

I probably wouldn’t say that there’s no useful information that we can take away from participants’ reported food choices. I think you’re being much too hard on yourselves there :). To quote some of a previous comment from Harish Sethu who made this point very well on the FaceBook thread: “All studies, even ones which do not detect something statistically significant, tell us something very useful. If nothing, they tell us something about the size of the effect relative to the size of any methodological biases that may mask those effects. Studies which fail to detect something statistically significant play a crucial role in informing the methodological design of future studies (and not just the sample size).”

“Based on all of the above, we don’t feel the study provides any concrete practical guidance. Therefore, the study won’t cause us to reallocate funding for our online advertising in a positive or negative direction.”

This seems surprising to me. It seems to imply that all of the information gathered cancels out and doesn’t cause one to update in the positively or negatively at all. I previously I had a relatively weak and fairly dispersed prior regarding the impact of animal advocacy via online advertising. I think the results of this study are reason to think that animal advocacy via online advertising in this context is less effective than I previously thought it to be. This is because this study suggests that it’s unlikely the effects of online advertising in this context are above a threshold which I previously assigned some probability to and I have now lessened the probability that I put on these larger effects in light of the findings of this study. As a result I would direct slightly less resources to online advertising in this context relative to other techniques then I would have prior to being aware of the results of this study.

“Large-scale studies such as this one have found that for online advertising campaigns, the majority of impact comes from changing the behaviors of those who view the ad but never click on it. In our study, we looked only at those who actually clicked on the ad. Therefore, our study wasn’t a study of the overall impact of online pro-vegetarian ads; rather it was a study on the impact of viewing online farmed animal cruelty video after clicking on an ad.”

I am not totally sure but I think that the wrong study may have been linked here accidentally. I skimmed the study that has been linked and couldn’t find any evidence that supports the claim that is being made and instead supports the claim that is made in the next bullet point of MFA’s report. But maybe I missed something when skimming. Is there is a specific page number or section of this linked study that has the relevant information? I would be quite interested in looking at a study that did find the results mentioned here.

“Large-scale studies of online advertising have also found that sample sizes of more than 1 million people are typically needed before statistically significant behavioral changes can be detected. We spoke to numerous data collection companies as well as Facebook and accessing that much data for a study like this is not possible.”

I am little skeptical of the external validity of those results when applied in the context of the ads used in this MFA study. This is because there seem to be a couple of key relevant differences between the types of ads used in these two studies. Based on my limited understanding MFA’s ads in this study were targeted FaceBook ads which when clicked took participants to a landing page which played a video. In contrast, this other cited study’s ads were display ads on Yahoo’s homepage and participants were random visitors who were assigned to treatment. Based on this, I am unsure how informative it is to mention this other study in this manner to support of the conclusion that it is difficult to draw practical conclusions from this MFA study. It might also be worth drawing attention to the previous power calculation than I did which suggested we may be able to get quite useful results from a total sample size of 15,000 participants.

Counterfactuals are always really difficult to get at but I can’t help but wonder if this would have been mentioned in the initial reporting of the study: “Numerous studies have found that self-reports on dietary choices are extremely unreliable. On top of that, we’ve also found that diet-change interventions (like the one we did with the experimental group) can change how people self-report their food choices. This combination of unreliability and discrepancy suggests the self-reports of servings of animal products eaten may not be valid.” if the study had found positive results? My intuition is that it wouldn’t and I think that may be a little bit problematic.

I think that it’s great that you have made the raw data from this experiment public. This is another thing that future studies would do well to emulate. I would love to do a quick re-analysis of the data to make sure that we reach the correct conclusions from this study. I had a quick look but it isn’t initially clear to me how to reach 684 participants in the control group and 749 people in the experiment group. My guess is that all of the participants who were 12 years or younger, didn’t fully complete the survey and were not female were excluded from the final analysis. Is that right? Or was there some additional exclusion criteria?

undefined @ 2016-02-20T05:12 (+4)

Would it be better to do a pre-analysis plan for future studies?

Did you see the methodology? Or are you wanting them to have committed to something in the analysis that they didn't talk about there?

Would it be better to do pre treatment/intervention and post treatment/intervention data collection rather than just post treatment/intervention data collection for future studies?

The idea is, have a third, smaller, group that went immediately to a survey? That's a good idea, and not that expensive per survey result. That helps you see the difference between things like whether the video makes people more likely to go veg vs reduces recidivism.

Was it worth using Edge Research to analyze the data for this study? Will external bodies like Edge Research do data analysis for future MFA studies?

The Edge analysis doesn't look useful to me, since they didn't do anything that unusual and there are lots of people in the community in a position to analyze the data. Additionally, my impression is that working with them added months of delay. So I certainly wouldn't recommend this in the future!

Why was the study so low powered? Was it originally thought that online ads were more effective or perhaps the study’s power was constrained by inadequate funding?

In the methodology they write: "We need to get at minimum 3.2k people to take the survey to have any reasonable hope of finding an effect. Ideally, we'd want say 16k people or more." My guess is they just failed to bring enough people back in for followup through ads.

“Edge Research then “weighted” the data so the respondents from each group were identical in gender, geography, and age.” I am not totally sure what this means and it seems important. It would be great if someone could please explain more about what the “weighting” process entails.

I think it's something like this. You take the combined experimental and control groups and you figure out for each characteristic (gender, country, age range) what the distribution is. Then if you happened to get extra UK people in your control group compared to your experimental group, instead of concluding that you made people leave the UK you conclude that you happened to over-sample the UK in the control and under-sample in the experimental. To fix this, you assign a weight to every response based on for each demographic how over- or under-sampled it is. Then if you're, say, totalling up servings of pork, instead of straight adding them up you first multiply the number of servings each person said they had by their weight, and then add them up.

undefined @ 2016-02-20T11:35 (+1)

Did you see the methodology?

Yeah- Looks like this is the relevant section:

“Later, Edge Research will complete “a final data report including (a) an outline of the research methodology and rationale, (b) high level findings and takeaways, and a (c) drill downs on specific areas and audiences.

It’s currently unclear what precise methodology Edge Research will use to analyze the data, but the expectation is that they would use a Chi-Square test to compare the food frequency questionnaires between both the treatment and control groups, looking both for meat reduction and elimination.”

Or are you wanting them to have committed to something in the analysis that they didn't talk about there?

Yeah I would have liked a more detailed pre-analysis plan. I think there was perhaps too much researcher freedom in the data analysis. This probably makes questionable data analysis techniques and inaccurate interpretation of results more likely. Some things that I think could have been useful to mention in a pre-analysis plan are:

Information about the data weighting process.
How incomplete survey responses will be treated.
How the responses of those who aren’t females aged 13-25 will be treated.

KG: “Would it be better to do pre treatment/intervention and post treatment/intervention data collection rather than just post treatment/intervention data collection for future studies?”

JK: “The idea is, have a third, smaller, group that went immediately to a survey? That's a good idea, and not that expensive per survey result. That helps you see the difference between things like whether the video makes people more likely to go veg vs reduces recidivism.”

The idea you suggest sounds promising but it’s not what I meant. With my initial question I intended to ask: Would it be better for future studies to have both a baseline collection of data prior to intervention and an endline collection of data sometime after the intervention rather than just an endline collection of data sometime after the intervention? I ask because my general impression is the standard practice for RCTs in the social sciences is to do pre and post intervention data collection and there’s likely good reasons for why that’s the case. I understand that there may be significant costs increases for pre and post intervention data collection relative to just post intervention data collection but I wonder if the possibly increased usefulness of a study’s results outweigh these increased costs.

The Edge analysis doesn't look useful to me, since they didn't do anything that unusual and there are lots of people in the community in a position to analyze the data. Additionally, my impression is that working with them added months of delay. So I certainly wouldn't recommend this in the future!

Sounds like we probably have pretty similar views about the limited value of Edge’s collaboration. I also probably wouldn’t recommend using them in future.

My guess is they just failed to bring enough people back in for follow up through ads.

That makes sense as a likely reason why the study was low powered. I wonder if alternative options could have been explored when/if it looked like this was the case to prevent the study from being so low powered. For instance, showing more people the initial ad in this circumstance could have led to more people completing the survey which would have likely have increased the power of the survey. Although it may have been difficult to do this for a variety of reasons.

You take the combined experimental and control groups and you figure out for each characteristic (gender, country, age range) what the distribution is. Then if you happened to get extra UK people in your control group compared to your experimental group, instead of concluding that you made people leave the UK you conclude that you happened to over-sample the UK in the control and under-sample in the experimental. To fix this, you assign a weight to every response based on for each demographic how over- or under-sampled it is. Then if you're, say, totalling up servings of pork, instead of straight adding them up you first multiply the number of servings each person said they had by their weight, and then add them up.

Thanks for explaining this- it’s much clearer to me now :)

undefined @ 2016-02-20T14:54 (+1)

I wonder if alternative options could have been explored when/if it looked like this was the case to prevent the study from being so low powered. For instance, showing more people the initial ad in this circumstance could have led to more people completing the survey which would have likely have increased the power of the survey. Although it may have been difficult to do this for a variety of reasons.

The way it worked is they advertised to a bunch of people, then four months later tried to follow up with as many as they could using cookie retargeting. At that point they learned they didn't have as much power as they hoped, but to fix it you need to take another four months and increase the budget by at least 4x.

undefined @ 2016-02-21T22:57 (+3)

I think the results of this study are reason to think that animal advocacy via online advertising in this context is less effective than I previously thought it to be. This is because this study suggests that it’s unlikely the effects of online advertising in this context are above a threshold which I previously assigned some probability to and I have now lessened the probability that I put on an effect like this in light of the findings of this study.

Which threshold was that and how did you arrive at that conclusion? I don't really know one way or another yet, but upgrading or downgrading confidence seems premature without concrete numbers.

As a result I would direct slightly less resources to online advertising in this context relative to other techniques then I would have prior to being aware of the results of this study.

It seems unfair to deallocate money from online ads where studies are potentially inconclusive to areas where studies don't exist, unless you have strong pre-existing reasons to distinguish those interventions as higher potential.

Would it be better to do a pre-analysis plan for future studies?

We pre-registered the methodology. It would have been nicer to also pre-register the survey questions and the data analysis plan. However, I was working with other people and limited time, so I didn't manage to make that happen.

Would it be better to do pre treatment/intervention and post treatment/intervention data collection rather than just post treatment/intervention data collection for future studies? By this I mean something like a baseline survey and an endline survey which seemed to be used in a lot of social science RCTs.

I think it is too unclear how the baseline survey could be effectively collected without adding another large source of declining participation.

Was it worth using Edge Research to analyze the data for this study?

No.

Will external bodies like Edge Research do data analysis for future MFA studies?

I don't know, but likely not, especially now that MFA has hired more in-house statistical expertise.

However, Edge Research in particular will not be used again.

Why was the study so low powered?

The response rate conversion of people getting the follow-up survey from the people who got the original treatment was lower than expected and lower than what had been determined via piloting the study.

Was it originally thought that online ads were more effective

No.

or perhaps the study’s power was constrained by inadequate funding?

No. The study was fully funded and was instead limited by how large of a Facebook ad campaign MFA could logistically manage at the time. I was told that if we wanted to increase the sample size further we'd likely have to either (a) relax the requirement that we target females age 13-25 or (b) go into foreign language ads.

“Edge Research then “weighted” the data so the respondents from each group were identical in gender, geography, and age.” I am not totally sure what this means and it seems important. It would be great if someone could please explain more about what the “weighting” process entails.

I also am clueless about this weighting process and think it should be disclosed. Though I think ACE re-analyzed the data without any weighting and came to pretty similar conclusions.

I had a quick look but it isn’t initially clear to me how to reach 684 participants in the control group and 749 people in the experiment group. My guess is that all of the participants who were 12 years or younger, didn’t fully complete the survey and were not female were excluded from the final analysis. Is that right? Or was there some additional exclusion criteria?

I think they just excluded all people who didn't answer every question.