AI Red Teaming at GiveWell: What We've Learned (and Where We'd Welcome Your Input)

By Brendan Phillips @ 2026-01-14T19:47 (+36)

At GiveWell, we've been experimenting with using AI to red team our global health intervention research—searching for weaknesses, blind spots, or alternative interpretations that might significantly affect our conclusions. We've just published a write-up on what we’ve learned, both about the programs we fund through donor support and about how to use AI in our research.

We're sharing this to invite critiques of our approach and to see if others have found methods for critiquing research with AI that work better. Specifically, we'd love to see people try their own AI red teaming approaches on our published intervention reports or grant pages. If you generate critiques we haven't considered or find prompting strategies that work better than ours, please share them in the comments—we'd be interested to see both your methodology and the specific critiques you uncover.

Our process

Our research team spends more than 70,000 hours each year reviewing academic evidence and investigating programs to determine how much good they accomplish per dollar spent. This in-depth analysis informs our grantmaking, directing hundreds of millions in funding annually to highly cost-effective, evidence-backed programs.

Our current approach for supplementing that research with AI red teaming:

  1. Literature review stage: An AI using "Deep Research" mode synthesizes recent academic literature on the intervention[1]

  2. Critique stage: A second AI reviews both our internal analysis and the literature summary to identify gaps in our analysis[2]

We applied this approach to six grantmaking areas, and it generated several critiques worth investigating per intervention, including:

For more on our current approach and the critiques it identified, see our public write-up.

Our prompting approach

Our red teaming prompt (example here) has a few key features:

We arrived at this through trial and error rather than systematic testing. We're uncertain which elements are actually driving the useful output or are counterproductive.

What we learned about using AI for research critiques

A few initial lessons:

A note on timing: This evaluation was conducted 4-5 months ago. While we haven't done systematic retesting with the same prompts and context, our impression is that critique relevance has improved, primarily through better alignment with the types of critiques we're looking for. Our rough guess is that the rate of relevant critiques may now be closer to ~30%, a meaningful improvement but not enough to change our research workflows.

Improvements we've considered but not pursued

We've deliberately kept our approach simple—running prompts through standard chat interfaces (Claude, ChatGPT, Gemini) that our researchers are already comfortable with. We've considered but chosen not to pursue:

We suspect the gains from adding complexity to this workflow would be marginal and unlikely to outweigh the friction of adopting less familiar tools. But we hold this view loosely—if someone has achieved meaningfully better results with more sophisticated approaches, we'd consider spending more time on these other approaches.

Why we're sharing this

We think we're still in the early stages of learning how to use AI well, but we've developed preliminary views about what works and what doesn't, and we'd appreciate input from others thinking about similar problems.

Specifically, we’d welcome hearing about:

If you're interested in trying your own approach on one of our published intervention reports, we'd be curious to see what you get–both methodology and output.

  1. ^

     Typically, ChatGPT Pro with Deep Research enabled or similar.

  2. ^

     This step typically uses whichever model from Anthropic, ChatGPT, or Gemini is considered the best for research at that moment.


titotal @ 2026-01-15T12:52 (+10)

Overall this seems like a sensible, and appropriately skeptical, way of using LLM's in this sort of work. 

In regards to improving the actual AI output, it looks like there is insufficient sourcing of claims in what it puts out, which is going to slow you down when you actually try and check the output. I'm looking at the red team output here on water turpidity. This was highlighted as a real contribution by the AI, but the output has zero sourcing on it's claims, which presumably made it much harder to actually check for validity. If you were to get this critique from a real, human, red-teamer, they would make it signficantly more easy to check that the critique was valid and sourced.

One question I have to ask is whether you are measuring how much time and effort is being extended into managing the output of these LLM's and sifting out the actually useful recommendations? When assessing whether the techniques are a success, you have to consider the counterfactual case where that time was replaced by human research time looking more closely at the literature, for example. 

titotal @ 2026-01-16T11:52 (+8)

One other issue I thought of since my other comment: you list several valid critiques that the AI made that you'd already identified, but were not in the provided training materials. You state that this gives additional credence to the helpfulness of the models:

three we were already planning to look into but weren't in the source materials we provided (which gives us some additional confidence in AI’s ability to generate meaningful critiques of our work in the future—especially those we’ve looked at in less depth).

However, just because the critique is not in the provided source materials, it doesn't mean that it's not in the wider training data of the LLM model. So for example, if Givewell talked about the identified issue of "optimal chlorine doses" in a blog comment or something, and that blog site got scraped into the LLM, then the critique is not a sign of LLM usefulness: they may just be parroting back your own findings to you. 

Yarrow Bouchard 🔸 @ 2026-01-15T15:27 (+2)

My experience is similar. LLMs are powerful search engines but nearly completely incapable of thinking for themselves. I use these custom instructions for ChatGPT to make it much more useful for my purposes:

When asked for information, focus on citing sources, providing links, and giving direct quotes. Avoid editorializing or doing original synthesis, or giving opinions. Act like a search engine. Act like Google.

There are still limitations:

The most one-to-one analogy for LLMs in this use case is Google. Google is amazingly useful for finding webpages. But when you Google something (or search on Google Scholar), you get a list of results, many of which are not what you’re looking for, and you have to pick which results to click on. And then, of course, you actually have to read the webpages or PDFs. Google doesn’t think for you; it’s just an intermediary between you and the sources.

I call LLMs SuperGoogle because they can do semantic search on hundreds of webpages and PDFs in a few minutes while you’re doing something else. LLMs as search engines is a geniune innovation.

On the other hand, when I’ve asked LLMs to respond to the reasoning or argument in a piece of writing or even just do proofreading, they have given incoherent responses, e.g. making hallucinatory "corrections" to words or sentences that aren’t in the text they’ve been asked to review. Run the same text by the same LLM twice and it will often give the opposite opinion of the reasoning or argument. The output is also often self-contradictory, incoherent, incomprehensibly vague, or absurd.

SummaryBot @ 2026-01-14T21:51 (+2)

Executive summary: GiveWell reports that using AI to red team its global health research has surfaced some worthwhile critiques—especially by filling literature gaps—but remains limited by low relevance rates, unreliable quantitative claims, and the need for substantial human filtering, and the team invites others to test alternative AI critique methods.

Key points:

  1. GiveWell piloted a two-stage AI red teaming process—AI literature synthesis followed by AI critique of internal analysis—across six grantmaking areas.
  2. The approach generated several critiques worth investigating, such as reinfection risks in syphilis programs, natural recovery bias in malnutrition treatment, and strain mismatch in malaria vaccines.
  3. The prompting strategy emphasized generating many candidate critiques, checking for novelty against the report, using structured categories, and including prompts aimed at less obvious perspectives.
  4. The authors found AI most useful for identifying relevant academic literature they had not yet incorporated, but least useful for interventions already extensively reviewed.
  5. AI-generated quantitative impact estimates were often unsupported, and roughly 85% of critiques were filtered out as irrelevant or based on misunderstandings.
  6. GiveWell chose not to pursue more complex workflows or custom tooling, judging that expected gains would likely be marginal relative to added friction, while remaining open to contrary evidence from others.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.