External evaluation of GiveWell's research

By Aaron Gertler 🔸 @ 2020-05-22T04:09 (+14)

This is a linkpost to https://blog.givewell.org/2013/03/01/external-evaluation-of-our-research/

Aaron's note: I stumbled over this today and got lost in the links; I hadn't realized how much effort GiveWell spent (at least in the early years) hiring people to evaluate their research. Here's a full list of their external reviews, which goes into a lot more depth than this post.

Also, this post was originally written in 2013; I just recently discovered it and decided to repost.


We’ve long been interested in the idea of subjecting our research to formal external evaluation. We publish the full details of our analysis so that anyone may critique it, but we also recognize that it can take a lot of work to digest and critique our analysis, and we want to be subjecting ourselves to constant critical scrutiny (not just to the theoretical possibility of it).

A couple of years ago, we developed a formal process for external evaluations, and had several such evaluations conducted and published. However, we haven’t had any such evaluations conducted recently. This post discusses why.

In brief,

Between these two factors, we aren’t currently planning to conduct more external evaluations in the near future. However, we remain interested in external evaluation and hope eventually to make frequent use of it again. And if someone volunteered to do (or facilitate) formal external evaluation, we’d welcome this and would be happy to prominently post or link to criticism.

 

The challenges of external evaluation

The challenges of external evaluation are significant:

On the “evaluating research” front, one plausible candidate for “qualified evaluator” would be an accomplished development economist. However, in practice many accomplished development economists (a) are extremely constrained in terms of the time they have available; (b) have affiliations of their own (the more interested in practical implications for aid, the more likely a scholar is to be directly involved with a particular organization or intervention) which may bias evaluation.

I felt that we found a good balance with a 2011 evaluation by Prof. Tobias Pfutze, a development economist. Prof. Pfutze took ten hours to choose a charity to give to – using GiveWell’s research as well as whatever other resources he found useful – and we “paid” him by donating funds to the charity he chose. However, developing this assignment, finding someone who was both qualified and willing to do it, and providing support as the evaluation was conducted involved significant capacity.

Given the time investment these sorts of activities require on our part, we’re hesitant to go forward with one until we feel confident that we are working with the right person in the right way and that the research they’re evaluating will be representative of our work for some time to come.

 

Improvements in informal evaluation

Over the last year, we feel that we’ve seen substantially more deep engagement with our research than ever before, even as our investments in formal external evaluation have fallen off.

 

Where we stand

 

We continue to believe that it is important to ensure that our work is subjected to in-depth scrutiny. However, at this time, the scrutiny we’re naturally receiving – combined with the high costs and limited capacity for formal external evaluation – make us inclined to postpone major effort on external evaluation for the time being.

That said,


Tsunayoshi @ 2020-05-22T12:10 (+5)

Related to external evaluations: 80000hours used to have a little box at the bottom of an article, indicating a score given to it by internal and external evaluators. Does anybody know, why this is not being done anymore?

Ozzie Gooen @ 2020-05-26T11:42 (+12)

Oh man, happy to have come across this. I'm a bit surprised people remember that article. I was one of the main people that set up the system, that was a while back.

I don't know specifically why it was changed. I left 80k in 2014 or so and haven't discussed this with them since. I could imagine some reasons why they stopped it though. I recommend reaching out to them if you want a better sense.

This was done when the site was a custom Ruby/Rails setup. This functionality required a fair bit of custom coding functionality to set up. Writing quality was more variable then than it is now; there were several newish authors and it was much earlier in the research process. I also remember that originally the scores disagreed a lot between evaluators, but over time (the first few weeks of use) they converged a fair bit.

After I left they migrated to Wordpress, which I assume would have required a fair effort to set up a similar system in. The blog posts seem like they became less important than they used to be; in favor of the career guide, coaching, the podcast, and other things. Also the quality has become a fair bit more consistent, from what I can tell as an onlooker.

The ongoing costs of such a system are considerable. First, it just takes a fair bit of time from the reviewers. Second, unfortunately, the internet can be a hostile place for transparency. There are trolls and angry people who will actively search through details and then point them out without the proper context. I think this review system was kind of radical, and can imagine it not being very comfortable to maintain, unless it really justified a fair bit of effort.

I'm of course sad it's not longer in place, but can't really blame them.

Nathan Young @ 2020-05-23T13:55 (+3)

I'm gonna be a bit of a maverick and split my comment into separate ideas so you can upvote or downvote them separately. I think this is a better way to do comments, but looks a bit spammy.

Aaron Gertler @ 2020-06-09T20:19 (+3)

I actually think this is a better way to do comments in most cases!

Nathan Young @ 2020-06-11T18:05 (+1)

I suggest that higher karma users can split yoru comments into multiple comments without editing any text.

Nathan Young @ 2020-06-11T18:01 (+1)

Also while were' here I think it would be cool if this website showed who else was currently viewing the page. Could also create more spontaneous EA connections. A friend has built something for this and I think it would be easy to implement.

Nathan Young @ 2020-06-11T17:50 (+1)

Yes though not being able to comment more than once every X seconds is a little frustrating. (Sometimes I write all my comments then split them up, then post them seperately)

Nathan Young @ 2020-05-23T13:59 (+2)

There should be more context on the important decision making tools

I could be wrong, but I think most decision are made using google sheets. I've read a few of these and I think there could be more context around which numbers are the most important.

Nathan Young @ 2020-05-23T14:05 (+1)

It should be possible to give feedback on specific point.

In the future I am confident that all articles will be able to have comments on any part of the text, like comments in a google doc. This means people can edit or comment on specific points. This is particularly important with fermi models and could be implemented - people can comment on each part of an evaluation to criticise some specific bit. One wrong leap of logic in an argument makes the whole argument void, so GiveWell's models need this level of scrutiny.

Nathan Young @ 2020-05-23T14:03 (+1)

All the most important models should have crowdsourced answers also.

I *think* GiveWell uses models to make decisions. It would be possible to crowdsource numbers for each step. I predict you would get better answers if you did this. The wisdom of crowds is a thing. It breaks down when the crowd doesn't understand the model, but if you are getting the to guess individual parts of a model, it works again.

Linked to the Stack Overflow point I made, I think there could easily be a site for crowdsourcing the answers to the GiveWells questions. I think there is a 10% chance that with 20k you could build a better site that could come up with better answers if EAs enjoyed making guesses for fun - wikipedia is the best encyclopaedia in the world. This is because it leverages the free time and energy of *loads* of nerds. GiveWell could do the same.

Aaron Gertler @ 2020-06-09T20:23 (+2)

Can you point to any examples of GiveWell numbers that you think a crowd would have a good chance of answering more accurately? A lot of the figures on the sheets either come from deep research/literature reviews or from subjective moral evaluation, both of which seem to resist crowdsourcing.

If you want to see what forecasting might look like around GiveWell-ish questions, you could reach out to the team at Metaculus and suggest they include some on their platform. They are, to my knowledge, the only EA-adjacent forecasting platform with a good-sized userbase. 

Overall, the amount of community participation in similar projects has historically been pretty low (e.g. no "EA wiki" has ever gotten mass participation going), and I think you'd have to find a way to change that before you made substantial progress with a crowdsourcing platform.

Aaron Gertler @ 2020-06-09T20:28 (+3)

Audience size is a big challenge here. There might be a few thousand people who are interested enough in EA to participate in the community at all (beyond donating to charity or joining an occasional dinner with their university group). Of those, only a fraction will be interested in contributing to crowdsourced intellectual work. 

By contrast, StackOverflow has a potential audience of millions, and Wikipedia's is larger still. And yet, the most active 1% of editors might account for... half, maybe, of the total content on those sites? (Couldn't quickly find reliable numbers.) 

If we extrapolate to the EA community, our most active 1% of contributors would be roughly 10 people, and I'm guessing those people already find EA-focused ways to spend their time (though I can't say how those uses compare to creating content on a website like the one you proposed).

Nathan Young @ 2020-06-11T17:51 (+1)

Has anyone ever tried making an EA stack exchange?

Nathan Young @ 2020-06-11T17:49 (+1)

I am not sure I can think of obvious numbers that a crowd couldn't answer with a similar level of accuracy. (There is also the question, of accuracy compared to what? future givewell evaluations?) Consider metaculus' record vs any other paid experts. I think your linked point about crowd size is the main one. How large a community could you mobilise to guess these things.

Metaculus produces world class answers off a user base of 12,000. How many users does this forum have? I guess if you ran an experiment here you'd be pretty close. If you ran it elswhere you might bet 1-10% buy-in. I think even 3 orders of magnitude off isn't bad for an initial test. And if it worked it seems likely you could be within 1 order of magnitude pretty quickly.

I suggest the difference between this and EA wiki would be that it was answering questions.

For the value that givewell offers testing this seems very valuable.

Aaron Gertler @ 2020-06-12T05:38 (+2)

I would say "comparing the crowd's accuracy to reality" would be best, but "future GiveWell evaluations" is another reasonable option. 

Consider Metaculus's record vs any other paid experts.

Metaculus produces world class answers off a user base of 12,000.

I don't know what Metaculus's record is against "other paid experts," and I expect it would depend on which experts and which topic was up for prediction. I think the average researcher at GiveWell is probably much, much better at probabilistic reasoning than the average pundit or academic, because GiveWell's application process tests this skill and working at GiveWell requires that the skill be used frequently.

I also don't know where your claim that "Metaculus produces world-class answers" comes from. Could you link to some evidence? (In general, a lot of your comments make substantial claims without links or citations, which can make it hard to engage with them.)

Open Philanthropy has contracted with Good Judgment Inc. for COVID forecasting, so this idea is definitely on the organization's radar (and by extension, GiveWell's). Have you tried asking them why they don't ask questions on Metaculus or make more use of crowdsourcing in general? I'm sure they'd have a better explanation for you than anything I could hypothesize :-)

Nathan Young @ 2020-06-12T19:32 (+1)

Noted on the lack of citations.

I don't feel like open philanthropy would answer my speculative emails. Now that you point it out they might, but in general I don't feel worthy of their time.

(Originally I wrote this beaty of a sentence " previously I don't think I'd have thought they thought me worthy of their time.")

Aaron Gertler @ 2020-06-12T21:53 (+4)

If you really think GiveWell or Open Philanthropy is missing out on a lot of value by failing to pursue a certain strategy, it seems like you should aim to make the most convincing case you can for their sake!

(Perhaps it would be safer to write a post specifically about this topic, then send it to them; that way, even if there's no reply, you at least have the post and can get feedback from other people.)

Nathan Young @ 2020-06-13T16:00 (+1)

Also, possibly room for a "request citation" button. When you talk in different online communities it's not clear how much citing you should do. An easy way to request and add citations would not require additional comments.

Nathan Young @ 2020-05-23T13:58 (+1)

I think is should be easier to give feedback on GiveWell. I would recommend not needing to login and allowing people to give suggestions on the text of pages.

Aaron Gertler @ 2020-06-09T20:33 (+2)

I don't know what you mean by "log in"; you can give feedback on their blog posts just by leaving a name + email address, and their pages don't have comment sections to log into.

By "suggestions on the text of pages," do you mean suggestions other people can view? That seems like it would be a technical challenge, and I'd be surprised if it brought in much additional useful commentary compared to the status quo (that is, sending an email to GiveWell if you have a suggestion).

Can you think of any websites that have implemented "suggestions on the text of pages" in a way that led to their content being better, outside of wikis?

Nathan Young @ 2020-06-11T18:05 (+1)

3) I think my work in handbooks finds this to be the case. Stack overflow as well. I can't think of permanent sites, but I think thats cultural and technological. I'm happy to bet at 40% that one of the 3 largest blog sites will in 5 years time.

I think for many they would baulk at the idea of others suggesting improvements for their work. I think it fits in pretty well with the EA frame of mind.

Nathan Young @ 2020-06-11T18:00 (+1)

1) Fair point, it's been a while since I went on givewell.

2) I probably suggest, there is a cursor in text that people can use. If they type is creates suggestions like in google docs. Then there is a button to "see other suggestions" and you can upvote them.

Nathan Young @ 2020-05-23T13:57 (+1)

I think StackOverflow is is the gold standard for criticism. It's a question answering website. It allows answers to be ranked and questions and answers to be edited. Not only do the best answers get upvoted, but answers and questions get clearer and higher quality over time. I suggest this should be the aim for GiveWell's anlyses.

See examples of all such features on this question: https://rpg.stackexchange.com/questions/169345/can-a-druid-use-a-sending-stone-while-in-wild-shape

Note: