External evaluation of GiveWell's research

By Aaron Gertler 🔸 @ 2020-05-22T04:09 (+14)

This is a linkpost to https://blog.givewell.org/2013/03/01/external-evaluation-of-our-research/

Aaron's note: I stumbled over this today and got lost in the links; I hadn't realized how much effort GiveWell spent (at least in the early years) hiring people to evaluate their research. Here's a full list of their external reviews, which goes into a lot more depth than this post.

Also, this post was originally written in 2013; I just recently discovered it and decided to repost.

We’ve long been interested in the idea of subjecting our research to formal external evaluation. We publish the full details of our analysis so that anyone may critique it, but we also recognize that it can take a lot of work to digest and critique our analysis, and we want to be subjecting ourselves to constant critical scrutiny (not just to the theoretical possibility of it).

A couple of years ago, we developed a formal process for external evaluations, and had several such evaluations conducted and published. However, we haven’t had any such evaluations conducted recently. This post discusses why.

In brief,

The challenges of external evaluation are significant. Because our work does not fall cleanly into a particular discipline or category, it can be difficult to identify an appropriate reviewer (particularly one free of conflicts of interest) and provide enough structure for their work to be both meaningful and efficient. We put a substantial amount of capacity into structuring and soliciting external evaluations in 2010, and if we wanted more external evaluations now, we’d again have to invest a lot of our capacity in this goal.
The level of in-depth scrutiny of our work has increased greatly since 2010. While we would still like to have external evaluations, all else equal, we also feel that we are now getting much more value than previously from the kinds of evaluations that we ultimately would guess are most useful – interested donors and other audience members scrutinizing the parts of our research that matter most to them.

Between these two factors, we aren’t currently planning to conduct more external evaluations in the near future. However, we remain interested in external evaluation and hope eventually to make frequent use of it again. And if someone volunteered to do (or facilitate) formal external evaluation, we’d welcome this and would be happy to prominently post or link to criticism.

The challenges of external evaluation

The challenges of external evaluation are significant:

There is a question around who counts as a “qualified” individual for conducting such an evaluation, since we believe that there are no other organizations whose work is highly similar to GiveWell’s. Our work is a blend of evaluating research and evaluating organizations, and it involves both in-depth scrutiny of details and holistic assessments of the often “fuzzy” and heterogeneous evidence around a question.

On the “evaluating research” front, one plausible candidate for “qualified evaluator” would be an accomplished development economist. However, in practice many accomplished development economists (a) are extremely constrained in terms of the time they have available; (b) have affiliations of their own (the more interested in practical implications for aid, the more likely a scholar is to be directly involved with a particular organization or intervention) which may bias evaluation.

Based on past work on external evaluation, we’ve found that it is very important for us to provide a substantial amount of structure for an evaluator to work within. It isn’t practical for someone to go over all of our work with a fine-toothed comb, and the higher-status the person, the more of an issue this becomes. Our current set of evaluations is based on old research, and to have new evaluations conducted, we’d need to create new structures based on current research. This would take trial-and-error in terms of finding an evaluation type that produces meaningful results.
There is also the question of how to compensate people for their time: we don’t want to create a pro-GiveWell bias by paying, but not paying further limits how much time we can ask.

I felt that we found a good balance with a 2011 evaluation by Prof. Tobias Pfutze, a development economist. Prof. Pfutze took ten hours to choose a charity to give to – using GiveWell’s research as well as whatever other resources he found useful – and we “paid” him by donating funds to the charity he chose. However, developing this assignment, finding someone who was both qualified and willing to do it, and providing support as the evaluation was conducted involved significant capacity.

Given the time investment these sorts of activities require on our part, we’re hesitant to go forward with one until we feel confident that we are working with the right person in the right way and that the research they’re evaluating will be representative of our work for some time to come.

Improvements in informal evaluation

Over the last year, we feel that we’ve seen substantially more deep engagement with our research than ever before, even as our investments in formal external evaluation have fallen off.

We conducted an internal evaluation with employee Jonah Sinick. Jonah went over our work on insecticide-treated nets at a high level of detail, and we posted his full notes (as well as a summary), and an in-depth discussion of the most major issue he raised.
We published new analysis on deworming, and in the process, we had significant engagement and back-and-forth with scholars who study deworming. See our post on revisiting the case for developmental effects from deworming (which we published after substantial back-and-forth with the authors of the study in question) and our discussion of a new literature review on deworming (which was reviewed by an author of the Cochrane review prior to publication and by a scholar who dissented with its findings subsequent to publication).
Reader David Barry did a highly in-depth review of our comparative cost-effectiveness analysis, and published his findings as a guest post on our blog.
Our recommendation of GiveDirectly prompted substantial pushback from our audience, and we believe this led to an elevated level of critical engagement with our research. For example, see this Giving What We Can post as well as audio and transcripts we’ve posted from in-person meetings and donor calls involving such critical engagement (see the 1/10/13, 12/6/12 and 11/26/12 items in particular). Engaging with this pushback led to our posts comparing evidence quality for our top charities’ interventions and further discussing our rankings.
We also believe that our content on top charities – and updates on their progress – has been more carefully reviewed than previously; for example, see this discussion of a footnote in an update we published on Schistosomiasis Control Initiative.

Where we stand

We continue to believe that it is important to ensure that our work is subjected to in-depth scrutiny. However, at this time, the scrutiny we’re naturally receiving – combined with the high costs and limited capacity for formal external evaluation – make us inclined to postpone major effort on external evaluation for the time being.

That said,

If someone volunteered to do (or facilitate) formal external evaluation, we’d welcome this and would be happy to prominently post or link to criticism.
We do intend eventually to re-institute formal external evaluation.

Tsunayoshi @ 2020-05-22T12:10 (+5)

Related to external evaluations: 80000hours used to have a little box at the bottom of an article, indicating a score given to it by internal and external evaluators. Does anybody know, why this is not being done anymore?

Ozzie Gooen @ 2020-05-26T11:42 (+12)

Oh man, happy to have come across this. I'm a bit surprised people remember that article. I was one of the main people that set up the system, that was a while back.

I don't know specifically why it was changed. I left 80k in 2014 or so and haven't discussed this with them since. I could imagine some reasons why they stopped it though. I recommend reaching out to them if you want a better sense.

This was done when the site was a custom Ruby/Rails setup. This functionality required a fair bit of custom coding functionality to set up. Writing quality was more variable then than it is now; there were several newish authors and it was much earlier in the research process. I also remember that originally the scores disagreed a lot between evaluators, but over time (the first few weeks of use) they converged a fair bit.

After I left they migrated to Wordpress, which I assume would have required a fair effort to set up a similar system in. The blog posts seem like they became less important than they used to be; in favor of the career guide, coaching, the podcast, and other things. Also the quality has become a fair bit more consistent, from what I can tell as an onlooker.

The ongoing costs of such a system are considerable. First, it just takes a fair bit of time from the reviewers. Second, unfortunately, the internet can be a hostile place for transparency. There are trolls and angry people who will actively search through details and then point them out without the proper context. I think this review system was kind of radical, and can imagine it not being very comfortable to maintain, unless it really justified a fair bit of effort.

I'm of course sad it's not longer in place, but can't really blame them.

Nathan Young @ 2020-05-23T13:55 (+3)

I'm gonna be a bit of a maverick and split my comment into separate ideas so you can upvote or downvote them separately. I think this is a better way to do comments, but looks a bit spammy.

Aaron Gertler @ 2020-06-09T20:19 (+3)

I actually think this is a better way to do comments in most cases!

Nathan Young @ 2020-06-11T18:05 (+1)

I suggest that higher karma users can split yoru comments into multiple comments without editing any text.

Nathan Young @ 2020-06-11T18:01 (+1)

Also while were' here I think it would be cool if this website showed who else was currently viewing the page. Could also create more spontaneous EA connections. A friend has built something for this and I think it would be easy to implement.

Nathan Young @ 2020-06-11T17:50 (+1)

Yes though not being able to comment more than once every X seconds is a little frustrating. (Sometimes I write all my comments then split them up, then post them seperately)

Nathan Young @ 2020-05-23T13:59 (+2)

There should be more context on the important decision making tools

I could be wrong, but I think most decision are made using google sheets. I've read a few of these and I think there could be more context around which numbers are the most important.

Nathan Young @ 2020-05-23T14:05 (+1)

It should be possible to give feedback on specific point.

In the future I am confident that all articles will be able to have comments on any part of the text, like comments in a google doc. This means people can edit or comment on specific points. This is particularly important with fermi models and could be implemented - people can comment on each part of an evaluation to criticise some specific bit. One wrong leap of logic in an argument makes the whole argument void, so GiveWell's models need this level of scrutiny.

Nathan Young @ 2020-05-23T14:03 (+1)

All the most important models should have crowdsourced answers also.

I *think* GiveWell uses models to make decisions. It would be possible to crowdsource numbers for each step. I predict you would get better answers if you did this. The wisdom of crowds is a thing. It breaks down when the crowd doesn't understand the model, but if you are getting the to guess individual parts of a model, it works again.

Linked to the Stack Overflow point I made, I think there could easily be a site for crowdsourcing the answers to the GiveWells questions. I think there is a 10% chance that with 20k you could build a better site that could come up with better answers if EAs enjoyed making guesses for fun - wikipedia is the best encyclopaedia in the world. This is because it leverages the free time and energy of *loads* of nerds. GiveWell could do the same.

Aaron Gertler @ 2020-06-09T20:23 (+2)

Can you point to any examples of GiveWell numbers that you think a crowd would have a good chance of answering more accurately? A lot of the figures on the sheets either come from deep research/literature reviews or from subjective moral evaluation, both of which seem to resist crowdsourcing.

If you want to see what forecasting might look like around GiveWell-ish questions, you could reach out to the team at Metaculus and suggest they include some on their platform. They are, to my knowledge, the only EA-adjacent forecasting platform with a good-sized userbase.

Overall, the amount of community participation in similar projects has historically been pretty low (e.g. no "EA wiki" has ever gotten mass participation going), and I think you'd have to find a way to change that before you made substantial progress with a crowdsourcing platform.

Aaron Gertler @ 2020-06-09T20:28 (+3)

Audience size is a big challenge here. There might be a few thousand people who are interested enough in EA to participate in the community at all (beyond donating to charity or joining an occasional dinner with their university group). Of those, only a fraction will be interested in contributing to crowdsourced intellectual work.

By contrast, StackOverflow has a potential audience of millions, and Wikipedia's is larger still. And yet, the most active 1% of editors might account for... half, maybe, of the total content on those sites? (Couldn't quickly find reliable numbers.)

If we extrapolate to the EA community, our most active 1% of contributors would be roughly 10 people, and I'm guessing those people already find EA-focused ways to spend their time (though I can't say how those uses compare to creating content on a website like the one you proposed).

Nathan Young @ 2020-06-11T17:51 (+1)

Has anyone ever tried making an EA stack exchange?

Nathan Young @ 2020-06-11T17:49 (+1)

I am not sure I can think of obvious numbers that a crowd couldn't answer with a similar level of accuracy. (There is also the question, of accuracy compared to what? future givewell evaluations?) Consider metaculus' record vs any other paid experts. I think your linked point about crowd size is the main one. How large a community could you mobilise to guess these things.

Metaculus produces world class answers off a user base of 12,000. How many users does this forum have? I guess if you ran an experiment here you'd be pretty close. If you ran it elswhere you might bet 1-10% buy-in. I think even 3 orders of magnitude off isn't bad for an initial test. And if it worked it seems likely you could be within 1 order of magnitude pretty quickly.

I suggest the difference between this and EA wiki would be that it was answering questions.

For the value that givewell offers testing this seems very valuable.

Aaron Gertler @ 2020-06-12T05:38 (+2)

I would say "comparing the crowd's accuracy to reality" would be best, but "future GiveWell evaluations" is another reasonable option.

Consider Metaculus's record vs any other paid experts.
Metaculus produces world class answers off a user base of 12,000.

I don't know what Metaculus's record is against "other paid experts," and I expect it would depend on which experts and which topic was up for prediction. I think the average researcher at GiveWell is probably much, much better at probabilistic reasoning than the average pundit or academic, because GiveWell's application process tests this skill and working at GiveWell requires that the skill be used frequently.

I also don't know where your claim that "Metaculus produces world-class answers" comes from. Could you link to some evidence? (In general, a lot of your comments make substantial claims without links or citations, which can make it hard to engage with them.)

Open Philanthropy has contracted with Good Judgment Inc. for COVID forecasting, so this idea is definitely on the organization's radar (and by extension, GiveWell's). Have you tried asking them why they don't ask questions on Metaculus or make more use of crowdsourcing in general? I'm sure they'd have a better explanation for you than anything I could hypothesize :-)

Nathan Young @ 2020-06-12T19:32 (+1)

Noted on the lack of citations.

I don't feel like open philanthropy would answer my speculative emails. Now that you point it out they might, but in general I don't feel worthy of their time.

(Originally I wrote this beaty of a sentence " previously I don't think I'd have thought they thought me worthy of their time.")

Aaron Gertler @ 2020-06-12T21:53 (+4)

If you really think GiveWell or Open Philanthropy is missing out on a lot of value by failing to pursue a certain strategy, it seems like you should aim to make the most convincing case you can for their sake!

(Perhaps it would be safer to write a post specifically about this topic, then send it to them; that way, even if there's no reply, you at least have the post and can get feedback from other people.)

Nathan Young @ 2020-06-13T16:00 (+1)

Also, possibly room for a "request citation" button. When you talk in different online communities it's not clear how much citing you should do. An easy way to request and add citations would not require additional comments.

Nathan Young @ 2020-05-23T13:58 (+1)

I think is should be easier to give feedback on GiveWell. I would recommend not needing to login and allowing people to give suggestions on the text of pages.

Aaron Gertler @ 2020-06-09T20:33 (+2)

I don't know what you mean by "log in"; you can give feedback on their blog posts just by leaving a name + email address, and their pages don't have comment sections to log into.

By "suggestions on the text of pages," do you mean suggestions other people can view? That seems like it would be a technical challenge, and I'd be surprised if it brought in much additional useful commentary compared to the status quo (that is, sending an email to GiveWell if you have a suggestion).

Can you think of any websites that have implemented "suggestions on the text of pages" in a way that led to their content being better, outside of wikis?

Nathan Young @ 2020-06-11T18:05 (+1)

3) I think my work in handbooks finds this to be the case. Stack overflow as well. I can't think of permanent sites, but I think thats cultural and technological. I'm happy to bet at 40% that one of the 3 largest blog sites will in 5 years time.

I think for many they would baulk at the idea of others suggesting improvements for their work. I think it fits in pretty well with the EA frame of mind.

Nathan Young @ 2020-06-11T18:00 (+1)

1) Fair point, it's been a while since I went on givewell.

2) I probably suggest, there is a cursor in text that people can use. If they type is creates suggestions like in google docs. Then there is a button to "see other suggestions" and you can upvote them.

Nathan Young @ 2020-05-23T13:57 (+1)

I think StackOverflow is is the gold standard for criticism. It's a question answering website. It allows answers to be ranked and questions and answers to be edited. Not only do the best answers get upvoted, but answers and questions get clearer and higher quality over time. I suggest this should be the aim for GiveWell's anlyses.

See examples of all such features on this question: https://rpg.stackexchange.com/questions/169345/can-a-druid-use-a-sending-stone-while-in-wild-shape

Note:

The question was answered by one individual but the edited by a much more experienced user. I think GiveWell could easily allow suggestions to their articles by the communty, which could be upvoted by other readers.
If Givewell doesn't want to test this, maybe try it on this forum first - allow people to suggest edits to posts.