How much do you believe your results?

By Eric Neyman @ 2023-05-05T19:51 (+211)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Paul_Christiano @ 2023-05-06T01:53 (+21)

There was a related GiveWell post from 12 years ago, including a similar example where higher "unbiased" estimates correspond to lower posterior expectations.

That post is mostly focused on practical issues about being a human, and much less amusing, but it speaks directly to your question #2.

(Of course, I'm most interested in question #3!)

Davidmanheim @ 2023-05-07T07:14 (+3)

Also see Why the Tails Come Apart, and to regressional Goodhart.

Karthik Tadepalli @ 2023-05-07T20:05 (+18)

Fun read! A point like this gets made every so often on the Forum, and I feel like a one-trick pony because I always make the same response, which is preempted in your question 1: these results rely heavily on the true spread of intervention quality being of the same order of magnitude as your experimental noise. And when intervention quality has a fat tailed distribution, that will almost never be true. If the best intervention is 10 SD better than the mean, any normally distributed error will have a negligible effect on our estimates of its quality.

And in general, experimental noise should be normal by the central limit theorem, so I don't know what you mean by "experimental noise likely has fatter tails than a log normal distribution".

Eric Neyman @ 2023-05-07T20:47 (+13)

Thanks -- I should have been a bit more careful with my words when I wrote that "measurement noise likely follows a distribution with fatter tails than a log-normal distribution". The distribution I'm describing is your subjective uncertainty over the standard error of your experimental results. That is, you're (perhaps reasonably) modeling your measurement as being the true quality plus some normally distributed noise. But -- normal with what standard deviation? There's an objectively right answer that you'd know if you were omniscient, but you don't, so instead you have a subjective probability distribution over the standard deviation, and that's what I was modeling as log-normal.

I chose the log-normal distribution because it's a natural choice for the distribution of an always-positive quantity. But something more like a power law might've been reasonable too. (In general I think it's not crazy to guess that the standard error of your measurement is proportional to the size of the effect you're trying to measure -- in which case, if your uncertainty over the size of the effect follows a power law, then so would your uncertainty over the standard error.)

(I think that for something as clean as a well-set-up experiment with independent trials of a representative sample of the real world, you can estimate the standard error well, but I think the real world is sufficiently messy that this is rarely the case.)

Karthik Tadepalli @ 2023-05-07T22:57 (+13)

In general I think it's not crazy to guess that the standard error of your measurement is proportional to the size of the effect you're trying to measure

Take a hierarchical model for effects. Each intervention has a true effect , and all the are drawn from a common distribution . Now for each intervention, we run an RCT and estimate where is experimental noise.

By the CLT, where is the inherent sampling variance in your environment and is the sample size of your RCT. What you're saying is that has the same order of magnitude as the variance of . But even if that's true, the standard error shrinks linearly as your RCT sample size grows, so they should not be in the same OOM for reasonable values of . I would have to do some simulations to confirm that, though.

I also don't think it's likely to be true that has the same OOM as the variance of . The factors that cause sampling variance - randomness in how people respond to the intervention, randomness in who gets selected for a trial, etc - seem roughly comparable across interventions. But the intervention qualities are not roughly comparable - we know that the best interventions are OOMs better than the average intervention. I don't think we have any reason to believe that the noisiest interventions are OOMs noisier than the average intervention.

(I think that for something as clean as a well-set-up experiment with independent trials of a representative sample of the real world, you can estimate the standard error well, but I think the real world is sufficiently messy that this is rarely the case.)

I'm not sure what you mean by this, I think any collection of RCTs satisfies the setting I've laid out.

JoshuaBlake @ 2023-05-09T11:57 (+3)

I think you're assuming your conclusion here:

Now for each intervention, we run an RCT and estimate where is experimental noise.

What if the noise is on the log scale?

Karthik Tadepalli @ 2023-05-10T05:24 (+4)

The central limit theorem is exactly that which implies what I said. The noise is not on the log scale because of the CLT.

Now, if you transform your coefficient into a log scale then all bets are off. But that is not happening throughout this post. And it's not really what happens in reality either. I don't know why anyone would do it.

JoshuaBlake @ 2023-05-06T08:59 (+9)

Very nice explanation, I think this problem is roughly the same as Noah Haber's winning entry for GiveWell's "change our mind" contest.

You can instead consider noiseless, partial measurements — ones that only consider some of the effects of an intervention, without considering others. (For the unmeasured effects you just stick with your priors.) Such interventions are “unbiased” in a different, more Bayesian sense: whatever your measurement is, your best guess for the quality of an intervention is equal to your measurement.

I'm struggling a little at what you're trying to say here, is it the issue of combining priors in deterministic model where you have priors on both the parameters and the outcome? There is some literature on this, and I believe that Bayesian melding (Poole and Raftery 2000) is the standard approach, which recommend logarithmic pooling of the priors.

Eric Neyman @ 2023-05-06T18:06 (+4)

Let's take the very first scatter plot. Consider the following alternative way of labeling the x and y axes. The y-axis is now the quality of a health intervention, and it consists of two components: short-term effects and long-term effects. You do a really thorough study that perfectly measures the short-term effects, while the long-term effects remain unknown to you. The x-value is what you measured (the short-term effects); the actual quality of the intervention is the x-value plus some unknown, mean zero variance 1 number.

So whereas previously (i.e. in the setting I actually talk about), we have E[measurement | quality] = quality (I'm calling this the frequentist sense of "unbiased"), now we have E[quality | measurement] = measurement (what I call the Bayesian sense of "unbiased").

Davidmanheim @ 2023-05-07T07:16 (+2)

Yes - though I think this is just an elaboration of what Abram wrote here.

Nayanika @ 2023-05-07T08:57 (+4)

The last three questions in this post were areas of my personal curiosity since my EA In-depth fellowship days. Hoping to have some anwers in any of the upcoming posts.

Gavin Bishop @ 2023-05-06T01:03 (+4)

This was a great read, illuminating and well-paced. Thanks!

David Mears @ 2023-05-24T13:39 (+2)

I'm taking away that how much I believe results is super sensitive to how I decide to model the distribution of actual intervention quality, and how I decide to model the contribution of noise.

How would I infer how to model those things?

Oscar Delaney @ 2023-05-21T12:09 (+1)

This was enjoyable to read and I was surprised by some of the results, thanks!