Global Health Charity Founders on GiveWell, Evidence Action, and M+E

By anonfounder @ 2026-05-17T19:13 (+93)

The following is a lightly edited and anonymized transcript of a discussion among charity founders, researchers, and funders in Ambitious Impact’s Slack workspace about GiveWell's decision to stop funding Evidence Action's Dispensers for Safe Water program. The conversation surfaced themes that we felt were worth sharing more broadly. Everyone involved consented to their thoughts being shared in this way. An LLM was used to process and anonymize the transcript.

For context/credibility, several of the organizations involved were GiveWell grantees.

Core Takeaways

I drew out a few core themes from the conversation as takeaways up at the top. The full anonymized discussion is reproduced below. Names have been redacted, and some messages have been rephrased to remove identifying details.

1. Independent verification steps have to be happening. M+E, at minimum, should be validating core program inputs and outputs at the start and end of a theory of change (e.g., in a basic healthcare program, you’d like to know whether the child was actually sick at the start of treatment, and whether they were actually healthy at the end of treatment). It was surprising this seemed to come late to Dispensers for Safe Water.

2. GiveWell probably underestimates implementation difficulty across its portfolio. Multiple participants — including GiveWell grantees themselves — argued that far more analytical rigor goes into assessing whether an intervention works under controlled conditions than into assessing whether a given organization can actually deliver it at scale. In general, doing things well in low resource environments is just really hard, and costs money. If other people cut corners and GiveWell isn’t monitoring them well to realize this, it ends up harming orgs that are investing the time required.

3. Cost estimation in cost-effectiveness analyses deserves much more attention. Currency fluctuations, inconsistent overhead allocation, and differing cost methodologies across organizations can shift cost-effectiveness estimates dramatically — sometimes more than the effect-size parameters that receive far more analytical scrutiny.

4. Proximity to the intervention of leadership/org specialization matters. Several participants argued that the Evidence Action program difficulties reflect the challenge of multi-program organizations where senior leadership cannot deeply understand every intervention. What distinguishes strong implementers can be about whether leadership actually lives the intervention day-to-day and understands it enough to iterate.

The Discussion

Context: GiveWell published a lookback on its ~$65 million grant to Evidence Action's Dispensers for Safe Water program. An independent survey found that internal monitoring had substantially overstated chlorine dispenser usage: when surveyors were required to photograph test results, reported chlorination rates dropped by about 20 percentage points.

The initial discussion focused on how the transparency from Evidence Action and GiveWell on this issue was commendable.

Kevin Starr, a prominent critic, later published a post arguing that Evidence Action has a pattern of implementation failures masked by "follow the evidence" messaging. The discussion below followed.

Donor Advisor A: I would be glad if others shared their views on this, because I am wondering how much this should make me re-evaluate Evidence Action negatively. Should it move my assessment by 1% or 25%?

Researcher B: I'm still not sure what to think of the dispensers case, but I do agree with the critique of No Lean Season. So many people I've met now hold the belief that the intervention "doesn't work" or "doesn't scale" — I used to think so too. It was only when I decided to dig into why exactly it didn't work that I learned it was an implementation failure. I think it's quite possible that the way Evidence Action talked about their experience discouraged other implementers from giving migration incentives a go.

Founder C: I'm probably on a small negative update — maybe -3%. I agree that implementation failures are avoidable failures, but I also think that some non-zero failure rate is a signal of being ambitious and fast enough. Overall I still think of Evidence Action as not dissimilar to the GiveWell All Grants Fund, but slightly more willing to take risks to scale quickly, which I think is net a good thing.

The slight negative update for me is more like "this failure seems relatively predictable and avoidable, so it says something about their implementation competence." I probably agree that the virtue signaling is a bit overdone.

But I also think they are still a rare example of an org publicly saying "we got this wrong so we are stopping it" — even if they are implying more transparency about their reasoning than is perhaps truly the case.

Researcher B: I'm not confident that Evidence Action is worse than the typical GiveWell grantee. But they've received more funding than most — $65 million for dispensers alone — which creates extra responsibility when it comes to M&E.

On whether they are "a rare example of an org publicly stopping a program:" I don't think they had another option. GiveWell concluded, after independent studies funded in 2020 and 2025, that Evidence Action's internal data was overestimating reach, and GiveWell decided to stop funding it. Evidence Action openly collaborated throughout, but it doesn't seem like a case of them discontinuing based on their own reasoning.

Founder D: Really interesting discussion. A slightly different take: an implementation "failure" does provide at least some evidence that an intervention is hard to implement and scale. If a really committed, resourced, and impact-focused org "fails" to implement, doesn't that say something about how easy the intervention is to replicate?

I also tend to believe that the near best-case implementation will happen in the RCT because of the attention and money involved. And it can be very hard to hide the measurement itself from recipients and implementers, so the measurement that happens in the RCT should basically be considered part of the intervention.

Researcher B: Great points. A few thoughts:

Yes, implementation failures do point to programs being difficult to implement. Sometimes it might make ideas unpromising. But sometimes not. I could easily see a charity doing app-based training for healthcare workers that fails to stop participants from dropping out and finds no improvement in health outcomes. Does that mean "app-based training for healthcare workers is too hard," or that it's a challenge that needs dedication and innovativeness to crack?

Concerning No Lean Season, a comparison case in my mind is New Incentives. Both programs offer conditional cash transfers — one for migration, the other for vaccines. And both programs' experience suggests that CCTs are hard to pull off. But my sense is that New Incentives has cracked the problem whereas Evidence Action gave up.

So maybe what you need in those cases is extra inputs, checks, M&E, etc. to make the programs work. A question this raises is whether the intervention still looks cost-effective even with all the extra resources required for high-fidelity implementation. In the case of New Incentives, the answer seems to be yes. For rural-urban migration incentives, it might be no.

While I agree that RCTs will typically observe a bigger effect size than scaled-up implementations, I don't think it's always necessarily the case. One weakness of RCTs is that they need to design their program upfront and stick to it throughout. You can't really tweak, innovate, or iterate while your RCT is live. This might be a problem for complex behavior-change interventions where, with good M&E, you might keep discovering better ways of delivering the intervention. So I suspect there are cases where charity programs are more effective than the original RCTs that inspired them.

All of this points to the need for strong M&E, measurement of what is and isn't happening, and trying to figure out why behavior change is or isn't occurring.

At this point, Tony Senanayake had written up a post on the EA Forum discussing some of these same issues. Discussion then started in response to some of the ideas raised there. The general argument in that piece is that funders (the piece focused on GiveWell) apply extraordinary attention to whether an intervention works under controlled conditions with far less attention to whether implementation is actually delivering impact, and proposing six steps funders could take to evaluate implementation fidelity more seriously.

Founder F: Really agree with the point about cost pressure in budgets [militating against strong investment in M+E] when you're trying to keep costs down for your top-line cost-effectiveness number. It creates a race-to-the-bottom architecture, because you're competing in allocation against another org that is willing to cut costs.

Founder C: Great piece. One counter could be that raising monitoring scrutiny might add excessive burden — one solution could be for monitoring scrutiny to go up while internal validity scrutiny comes down a bit alongside it.

Then again, I'm also sympathetic to the claim that the sector needs to just level up its implementation quality and validation, and everyone needs to accept this will cost money. The audit analogy makes a lot of sense to me. The value-of-information argument is solid. Perhaps the "real" cost of confidently improving wellbeing is just higher than GiveWell currently assumes, because lots of its grantees are currently doing so less effectively than is modeled. More monitoring investment across the board would match actual expenses to this slightly higher true cost while allocating resources more efficiently to the best projects.

Founder F: I think the framing around — if GiveWell forces a higher monitoring burden on all of its grantees, the relative cost-effectiveness of each stays the same — makes a lot of sense. And I agree that GiveWell probably underestimates implementation difficulty and is assuming that its grantees are better implementers across the board than they actually are. I say this as a GiveWell grantee.

Founder G: I agree with most of the analysis above. Here are my thoughts.

This was an implementation failure first. The chlorine wasn't in the water. In multiple countries. After tens of millions of dollars. This wasn't an aberration — it was a normal failure state, probably for years. This is very poor implementation by any standard. It bewilders me that neither their implementation team nor their M&E team picked this up in any country.

I 100% agree that GiveWell hugely underrates the difficulty of doing real things in low-income countries.

On the implementation side, I think much of the poor implementation can be explained by the do-many-things approach of large, tens-of-millions-of-dollar, multi-intervention orgs. These orgs are jack-of-all-trades, do many things, and are unlikely to become master of any. Perhaps the fundamental failure was that Evidence Action just wasn't very good at getting dispensers out there, training communities, and doing basic follow-up. They are an org run by program managers and NGO workers who are mostly not experts on water dispensers.

I think with a few exceptions, most orgs should do one thing and get very, very good at it over a long period of time. It takes many years to see blind spots, incrementally improve, and actually get good at doing something.

I would probably rather that GiveWell focused on funding orgs that already do something very well, rather than put out RFPs for interventions they've calculated are "high expected value" and then hand out multi-million dollar contracts for orgs to build an intervention from scratch. This means those orgs are doing 3-5 completely different things at any one time, and their top management can't realistically be expected to understand all of these deeply.

That said, it's interesting that this "failed" intervention — which reached maybe 1/4 as many people as originally thought — was still probably helping people more than 90% of NGO work out there.

Founder C: Good points. I also think that multi-program orgs can sometimes claim expertise on a subject to raise money, when the actual team implementing the project doesn't contain those experts and are basically beginners.

A counter to the specialization argument: maybe there isn't the time or ingredients available to start many new specialist orgs to solve all these problems at scale, so maybe it's still additive to have Evidence Action-type orgs that can quickly spin up programs. But it seems reasonable to argue that those non-specialist orgs need particular scrutiny on their implementation.

Founder F: I'll chime back in with some thoughts on these threads.

Generally, people don't spend enough time on M&E — the incentive structure is bad.

I actually don't know that specialist vs. non-specialist org is the right frame, though it is probably part of the problem. I think it comes down to: where is your leadership team, and how much hands-on interface does your program team actually have with the intervention itself?

Is your program team a bunch of career NGO people in an office who basically just make grants to people who actually do the implementation thousands of miles away? Or is your program team hands-on with the intervention — the actual senior people — seeing problems, iterating?

I'd note that this is a problem even specialist orgs can fall into. If you implement through an implementing partner with occasional field visits, can we really say you're going to catch intervention issues better? Maybe you gain more specific subject matter expertise over time, so it's easier. But you're still not there, you're still not iterating on the problem.

I'd argue that the reason certain successful orgs work is not wholly (or even mostly) that they are specialized, but because senior leadership lives the intervention and actually understands it and holds people to standard — which you can't do if you aren't there.

Basically, I worry a lot about this happening to other charities with similar implementation structures, even if they are more specialized — ticking time bombs of unknown unknowns.

Founder C: One random thought — in GiveWell's lookback of New Incentives, they roughly doubled actual cost-effectiveness vs. estimated, largely because cost per child reached was roughly half the estimation. A large driver of this seems to have been devaluation of the Naira — they modeled that NI would get 1250 Naira/USD and in practice they got 670-790.

How much time does GiveWell spend thinking about exchange rates?

Given uncertainties in parameters like this, it does seem quite apparent that you can spend a lot of effort trying to be precise in one corner of your CEA while there's another broad — and possibly pretty irreducible — uncertainty sitting somewhere else that wipes out that precision.

Founder E: From some direct experiences — very little focus is placed on financial forecasts including conversion rates. Strongly agree: the time spent on a 5% upward or downward adjustment from literature can easily be wiped out by changes in other aspects.

Researcher B: Just to flag, this conversation touches on a broader point I've noticed: across EA cost-effectiveness analyses, much more typically goes toward estimating the effect than estimating the costs. Sometimes you'll see a GiveWell CEA that has one line for costs — something like "average per-person cost reported by the grantee" — and then a hundred lines for effects.

You could argue that makes sense, since the costs of live programs are roughly known whereas effects need to be estimated. But if you're projecting costs into the future, you run into issues that deserve attention — inflation, rising wages, varying exchange rates. This is important e.g., in AIM’s case, since they typically model the costs of hypothetical programs based on the costs of similar programs in other countries at some point in the past.

Founder F: No one does inflation or exchange rates well. I say this as someone who has put at least some work into it, and we've still seen crazy shifts in our CE number based on [African country currency] changes over the last 24 months. We should get more private sector input into this instead of just trying to figure it out ourselves.

There are real differences between cost-modeling methodologies — EA tends toward all-in costs, big INGOs split costs in very complex ways. I don't think GiveWell or other evaluators take this seriously enough, and I've seen grants go to people quoting frankly insane cost numbers where there is no way that is their actual cost — they just calculated it in a weird way.

That said, I also think there are downsides to all-in costing approaches, where it discourages investment in the future or in R&D, because that might make you not cost-effective now if you include all those costs.

Founder E: Another thing I have noticed is which program costs are considered and which are not, particularly in multi-program organizations. For example, some single-intervention orgs report all costs including all overheads. However, orgs implementing many different interventions do not include certain aspects of overheads when sharing cost per beneficiary. These differences can lead to substantially different estimates.

Founder G: I'm pretty comfortable saying that certain external data collection organizations are too expensive out of proportion to what they provide. They have a bit of a monopoly in some areas. Some orgs have tried to be a "lean" alternative, but I think there's a big opportunity to halve the cost of external monitoring or more.

Founder D: While GiveWell's level of analytic rigor on the evidence base is undeniably high, I personally suspect that many of the RCTs that make up the evidence base have similar issues to those identified in this case study. A few examples:

In many RCTs, the implementing organization is very involved in data collection, and sometimes the academics running the RCT are also developing and leading implementation of the intervention itself. For many global health interventions, blinding isn't feasible. Observer effects and desirability effects are likely present and likely to be higher in intervention arms. We've found examples of large RCTs that don't follow CONSORT guidelines, and when we dig into the details, we uncover issues like baseline imbalances that threaten the validity of conclusions.

I believe there's a kind of halo effect from pharmaceutical clinical trials through which cluster RCTs of complex interventions are put in the same "gold standard" category. When we look at the implementation details, an individual RCT that goes through FDA review has more differences from most global health cluster RCTs than it has similarities.

Founder F: Great point — the evidence all kind of sucks. Though I'd still think probably better to at least try than give up on it entirely. Not a ton of better options unfortunately.

Founder G: [Switching back to costs/budget discussion] As boring as it is, I think its probably reasonable for GiveWell to expect quite a detailed budget for any specific "project" over say 500k (or even a million) dollars. Sure its some admin time but that's also a !!#@!ton of money.

This should include all costs including management costs and government assistance costs, which the BINGOs are at risk of leaving out. With AI, both producing and analysing this kind of budget takes half the time (or less) than it used to.

I think as intervention size grows founder salaries become less of a big deal. I think management costs generally scale with implementation costs in a fairly linearish way (might be 2x or .5x but that's not going to change cost-effectiveness by more than 20%-30%, Costs don't always go down though as scale happens, that's a whole nother discussion but it can easily go either way.

James Herbert @ 2026-05-18T10:36 (+8)

Thanks for taking the time to anonymise and share this - good insights here so it’s great to have it in the public domain

Toby Tremlett🔹 @ 2026-05-18T10:40 (+4)

+1, so many of my wishlist conversations to get on the Forum are in slacks somewhere. This is a great example of how we can make those convos public.

Calum @ 2026-05-18T18:14 (+5)

Cool discussion, thanks for sharing it!!

ryancbriggs @ 2026-05-22T15:16 (+4)

This was an interesting read. Thank you for sharing it. This reflects a lot of my own views, especially the parts about RCT evidence in Development being weaker than is deserved (given the medicine halo) and the challenges in estimating costs (and how much that matters). Whenever I dig in for teaching or research I nearly always leave with more uncertainty, which is a bit of a scary feeling.

SummaryBot @ 2026-05-18T16:09 (+2)

Executive summary: The discussion argues that the Evidence Action case reflects broader weaknesses in GiveWell-style evaluation around implementation fidelity, monitoring incentives, and cost modeling, while also highlighting disagreements about how much these failures should update views of Evidence Action specifically.

Key points:

Multiple participants argued that GiveWell and the broader EA ecosystem focus much more on proving interventions work in RCTs than on verifying whether organizations can actually implement them effectively at scale, especially in difficult low-resource environments.
Several contributors said the Dispensers for Safe Water case showed serious failures in implementation and monitoring, since independent verification found chlorine usage had been overstated for years despite tens of millions of dollars in funding.
Participants debated how negatively to update on Evidence Action specifically, with views ranging from small negative updates to claims that the organization’s multi-program structure and limited intervention-specific expertise likely contributed to predictable implementation failures.
Many commenters argued that incentives around cost-effectiveness create underinvestment in monitoring and evaluation, because organizations that spend more on rigorous M&E can appear less cost-effective than competitors cutting corners.
Several participants claimed that cost estimation in EA CEAs receives too little scrutiny relative to effect estimation, despite exchange rates, inflation, overhead allocation, and differing accounting methodologies sometimes shifting cost-effectiveness estimates more than disputed effect-size assumptions.
The discussion also questioned the reliability of the underlying evidence base itself, with some participants arguing that many global health RCTs suffer from observer effects, weak blinding, implementation involvement by researchers, and methodological weaknesses that are often overlooked because RCTs inherit a “gold standard” reputation from pharmaceutical trials.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.