A Bayesian framework for interpreting impact evaluations

By Lizka @ 2022-04-28T12:27 (+23)

This is a linkpost to https://statmodeling.stat.columbia.edu/2022/04/27/hey-check-this-out-its-really-cool-a-bayesian-framework-for-interpreting-findings-from-impact-evaluations/

This is from the blog, "Statistical Modeling, Causal Inference, and Social Science." Andrew Gelman, one of the authors of the blog, has given me permission to cross-post.

This is a cross-post of a cross-post, I guess?

The post ("Hey, check this out, it’s really cool: A Bayesian framework for interpreting findings from impact evaluations.")

Chris Mead points us to a new document by John Deke, Mariel Finucane, and Daniel Thal, prepared for the Department of Education’s Institute of Education Sciences. It’s caled The BASIE (BAyeSian Interpretation of Estimates) Framework for Interpreting Findings from Impact Evaluations: A Practical Guide for Education Researchers, and here’s the summary:

BASIE is a framework for interpreting impact estimates from evaluations. It is an alternative to null hypothesis significance testing. This guide walks researchers through the key steps of applying BASIE, including selecting prior evidence, reporting impact estimates, interpreting impact estimates, and conducting sensitivity analyses. The guide also provides conceptual and technical details for evaluation methodologists.

I looove this, not just all the Bayesian stuff but also the respect it shows for the traditional goals of null hypothesis significance testing. They’re offering a replacement, not just an alternative.

Also they do good with the details. For example:

Probability is the key tool we need to assess uncertainty. By looking across multiple events, we can calculate what fraction of events had different types of outcomes and use that information to make better decisions. This fraction is an estimate of probability called a relative frequency. . . .
The prior distribution. In general, the prior distribution represents all previously available information regarding a parameter of interest. . . .

I really like that they express this in terms of “evidence” and “information” rather than “belief.”

They also discuss graphical displays and communication that is both clear and accurate; for example recommending summaries such as, “We estimate a 75 percent probability that the intervention increased reading test scores by at least 0.15 standard deviations, given our estimates and prior evidence on the impacts of reading programs for elementary school students.”

And it all runs in Stan, which is great partly because Stan is transparent and open-source and has good convergence diagnostics and a big user base and is all-around reliable for Bayesian inference, and also because Stan models are extendable: you can start with a simple hierarchical regression and then add measurement error, mixture components, and whatever else you want.

And this:

Local Stop: Why we do not recommend the flat prior
A prior that used to be very popular in Bayesian analysis is called the flat prior (also known as the improper uniform distribution). The flat prior has infinite variance (instead of a bell curve, a flat line). It was seen as objective because it assigns equal prior probability to all possible values of the impact; for example, impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all treated as equally plausible.
When probability is defined in terms of belief rather than evidence, the flat prior might seem reasonable—one might imagine that the flat prior reflects the most impartial belief possible (Gelman et al., 2013, Section 2.8). As such, this prior was de rigueur for decades.
But when probability is based on evidence, the implausibility of the flat prior becomes apparent. For example, what evidence exists to support the notion that impacts on test scores of 0, 0.1, 1, 10, and 100 percentile points are all equally probable? No such evidence exists; in fact, quite a bit of evidence is completely inconsistent with this prior (for example, the distribution of impact estimates in the WWC [What Works Clearinghouse]). The practical implication is that the flat prior overestimates the probability of large effects. Following Gelman and Weakliem (2009), we reject the flat prior because it has no basis in evidence.
The implausibility of the flat prior also has an interesting connection to the misinterpretation of p-values. It turns out that the Bayesian posterior probability derived under a flat prior is identical (for simple models, at least) to a one-sided p-value. Therefore, if researchers switch to Bayesian methods but use a flat prior, they will likely continue to exaggerate the probability of large program effects (which is a common result when misinterpreting p-values). . . .

Yes, I’m happy they cite me, but the real point here is that they’re thinking in terms of modeling and evidence, also that they’re connecting to important principles in non-Bayesian inference. As the saying goes, there’s nothing so practical as a good theory.

What makes me particularly happy is the way in which Stan is enabling applied modeling.

This is not to say that all our problems are solved. Once we do cleaner inference, we realized the limitations of experimental data: with between-person studies, sample sizes are never large enough to get stable estimates of interactions of interest (recall 16), which implies the need for . . . more modeling, as well as open recognition of uncertainty in decision making. So lots more to think about going forward.

Full disclosure: My research is funded by the Institute of Education Sciences, and I know the authors of the above-linked report.