Do Prof Eva Vivalt's results show 'evidence-based' development isn't all it's cut out to be?
By Robert_Wiblin @ 2018-05-21T16:28 (+17)
I recently interviewed a member of the EA community - Prof Eva Vivalt - at length about her research into the value of using trials in medical and social science to inform development work. The benefits of 'evidence-based giving' has been a core message of both GiveWell and Giving What We Can since they started.
Vivalt's findings somewhat challenge this, and are not as well known as I think they should be. The bottom line is that results from existing studies only weakly predict the results of similar future studies. They appear to have poor 'external validity' - they don't reliably indicate the measured result an intervention will seem to have in future. This means that developing an evidence base to figure out how well projects will work is more expensive than it otherwise would be.
Perversely, in some cases this can make further studies more informative, because we currently know less than we would if past results generalized well.
Note that Eva discussed an earlier version of this paper at EAG 2015.
Another result that conflicts with messages 80,000 Hours has used before is that experts on average are fairly good at guessing the results of trials (though you need to average many guesses). Aggregating these guesses may be a cheaper alternative to running studies, though the guesses may become worse without trial results to inform them.
Eva's view is that there isn't much alternative to collecting evidence like this - if it's less useful we should just accept that, but continue to run and use studies of this kind.
I'm more inclined to say this should shift our approach. Here's one division of the sources of information that inform our beliefs:
- Foundational priors
- Trials in published papers
- Everything else (e.g. our model of how things work based on everyday experience).
Inasmuch as 2 looks less informative, we should rely more on the alternatives (1 and 3).
Of course Eva's results may also imply that 3 won't generalize between different situations either. In that case, we also have more reason to work within our local environment. It should nudge us towards thinking that useful knowledge is more local and tacit, and less universal and codified. We would then have greater reason to become intimately familiar with a particular organisation or problem and try to have most of our impact through those areas we personally understand well.
It also suggests that problems which can be tackled with published social science may not have as high tractability - relative to alternative problems we could work on - as it first seems.
You can hear me struggle to figure out how much these results actually challenge conventional wisdom in the EA community later in the episode, and I'm still unsure.
For an alternative perspective from another economist in the community, Rachel Glennerster, you can read this article: Measurement & Evaluation: The Generalizability Puzzle. Glennerster believes that generalisability is much less of an issue than does Vivalt, and is not convinced by how she has tried to measure it.
There are more useful links and a full transcript on the blog post associated with the podcast episode.
undefined @ 2018-05-22T10:20 (+3)
This video might also add to the discussion - the closing panel at CSAE this year was largely on methodology, moderated by Hilary Greaves (head of the new Global Priorities Institute at Oxford), with Michael Kremer, Justin Sandefur, Joseph Ssentongo, and myself. Some of the comments from the other panellists still stick with me today.
https://ox.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=ec3f076c-9c71-4462-9b84-a8a100f5a44c
undefined @ 2018-05-21T16:43 (+3)
I agree that limitations on RCTs are a reason to devalue them relative to other methodologies. They still add value over our priors, but I think the best use cases for RCTs are when they're cheap and can be done at scale (Eg. in the context of online surveys) or when you are randomizing an expensive intervention that would be provided anyway such that the relative cost of the RCT is cheap.
When costs of RCTs are large, I think there's reason to favor other methodologies, such as regression discontinuity designs, which have faired quite well compared to RCTs (https://onlinelibrary.wiley.com/doi/abs/10.1002/pam.22051).
undefined @ 2018-05-22T00:02 (+3)
I agree that it would be important to weigh the costs and benefits - I don't think it's exclusively an issue with RCTs, though.
One thing that could help in doing this calculus is a better understanding of when our non-study-informed beliefs are likely to be accurate.
I know at least some researchers are working in this area - Stefano DellaVigna and Devin Pope are looking to follow up their excellent papers on predictions with another one looking at how well people predict results based on differences in context, and Aidan Coville and I also have some work in this area using impact evaluations in development and predictions gathered from policymakers, practitioners, and researchers.
undefined @ 2018-05-21T18:42 (+1)
Would the development of a VoI checklist be helpful here? Heuristics and decision criteria similar to the flowchart that the Campbell collab. has for experimental design heuristics.
undefined @ 2018-05-30T18:51 (+2)
I think Evidence-Based Policy: A Practical Guide To Doing It Better is also a good source here. The blurb:
Over the last twenty or so years, it has become standard to require policy makers to base their recommendations on evidence. That is now uncontroversial to the point of triviality--of course, policy should be based on the facts. But are the methods that policy makers rely on to gather and analyze evidence the right ones? In Evidence-Based Policy, Nancy Cartwright, an eminent scholar, and Jeremy Hardie, who has had a long and successful career in both business and the economy, explain that the dominant methods which are in use now--broadly speaking, methods that imitate standard practices in medicine like randomized control trials--do not work. They fail, Cartwright and Hardie contend, because they do not enhance our ability to predict if policies will be effective.
The prevailing methods fall short not just because social science, which operates within the domain of real-world politics and deals with people, differs so much from the natural science milieu of the lab. Rather, there are principled reasons why the advice for crafting and implementing policy now on offer will lead to bad results. Current guides in use tend to rank scientific methods according to the degree of trustworthiness of the evidence they produce. That is valuable in certain respects, but such approaches offer little advice about how to think about putting such evidence to use. Evidence-Based Policy focuses on showing policymakers how to effectively use evidence, explaining what types of information are most necessary for making reliable policy, and offers lessons on how to organize that information.
undefined @ 2018-05-21T18:44 (+2)
Really happy to see this get some attention. I think this is where the biggest potential value add of EA lies. Very very few groups are prepared to do work on methodological issues. Those that do seem to generally get bogged down in object level implementation details quickly (See: the output of METRICS for example.) Method work is hard, connecting people and resources to advance it is neglected.
undefined @ 2018-05-22T00:28 (+3)
And when groups do work on these issues there is a tendency towards infighting.
Some things that could help:
- Workshops that bring people together. It's harder to misinterpret someone's work when they are describing it in front of you, and it's easier to make fast progress towards a common goal (and to increase the salience of the goal).
- Explicitly recognizing that the community is small and needs nurturing. It's natural for people to at first be scared that someone else is in their coveted area (career concerns), but overall I think it might be a good thing even on a personal level. It's such a neglected topic that if people work together and help bring attention to it real progress could be made. In contrast, sometimes you see a subfield where people are so busy tearing down each other's work that nothing can get published or funded - a much worse equilibrium.
Bringing people together is hugely important to working constructively.
undefined @ 2018-05-21T17:56 (+2)
I also conduct research on the generalizability issue, but from a different perspective. In my view, any attempt to measure effect heterogeneity (and by extension, research generalizability) is scale dependent. It is very difficult to tease apart genuine effect heterogeneity from the appearance of heterogeneity due to using an inappropriate scale to measure the effects.
In order to to get around this, I have constructed a new scale for measuring effects, which I believe is more natural than the alternative measures. My work on this is available on arXiv at https://arxiv.org/abs/1610.00069 . The paper has been accepted for publication at the journal Epidemiologic Methods, and I plan to post a full explanation of the idea here and on Less Wrong when it is published (presumably, this will be a couple of weeks from now).
I would very much appreciate feedback on this work, and as always, I operate according to Crocker's Rules.
undefined @ 2018-05-21T18:39 (+2)
I think counterfactual outcome state transition parameters is a bad name in that it doesn't help people identify where and why they should use it, nor does it communicate all that well what it really is. I'd want to thesaurus each of the key terms in order to search for something punchier. You might object that essentially 'marketing' an esoteric statistics concept seems perverse, but papers with memorable titles do in fact outperform according to the data AFAIK. Sucks but what can you do?
I bother to go into this because this research area seems important enough to warrant attention and I worry it won't get it.
undefined @ 2018-05-21T18:48 (+1)
Thank you! I will think about whether I can come up with a catchier name for future publications (and about whether the benefits outweight the costs of rebranding).
If anyone has suggestions for a better name (for an effect measure that intuitively measures the probability that the exposure switches a person's outcome state), please let me know!