My Model of EA and AI Safety

By Eva Lu @ 2025-06-24T06:23 (+9)

Being in the AI safety community, sometimes I get the suspicion that we all just assume AI safety is the most important thing ever, without actually having a good model to justify that. Especially now that we're lobbying governments and asking others to make sacrifices for AI safety, it seems important to have a good model and to not lobby for things that sound good but actually aren't.

I also feel like my model of AI risk comes from many different places, and there hasn't been a single document that captures the whole thing, so I wrote this partially to clarify my own thoughts. You may see the big picture a different way than I do, and I'm not sure I could justify why I like my way better than others, so reader beware!

What's the AI doom?

Everyone dies.

I think this is actually not obvious. There are many other risks that people care about, but to me they seem more contentious. AI could lock in bad values, but value lock-in might also be the only way to defeat Moloch and end scarcity. AI might create S-risks, but some people are non-utilitarians and don't care about that. Those other risks are definitely worth researching and I'm glad some EAs are doing that, but I don't have strong opinions on these questions. Notkilleveryoneism is a great compromise.

Where does the doom come from?

In one of his talks, Yoshua Bengio made a nice dissection of AI risk into intelligence, affordances, and goals. Under each category, I've added some agendas that I think mitigate the specific risk.

Intelligence: AIs can't cause catastrophic harm if they don't know how. Agendas for this:

Stop/Pause AI

Affordances: Make it hard for AIs to fight against us, and keep it easy for AIs to help us. Agendas on this:

Applied Interpretability, like lie detectors and ELK.
AI control.
Evals & some of the governance things.
Biosecurity and real world security.

Goals: Make the AI want to help us. Agendas on this:

RLHF and similar prosaic Alignment (possibly using mechinterp).
Superalignment and other training techniques.
Agent foundations.
Cyborgism and other things to keep humans in the loop.

Why might the doom be likely?

There are much better writeups of why AI risk is plausible, such as Eliezer's List of Lethalities or Richard's Risks from Learned Optimization, but I find it useful to also think in terms of markets and market failures.

I think following the market is often a good way to maximize EV because markets have really great ground-truth feedback mechanisms, and maximizing profit often aligns with maximizing value. The big companies have huge demand for prosaic alignment research right now, and anyone who helped that market reach equilibrium faster has probably created lots of value.

Addressing market failures is also a good way to maximize EV because market failures often indicate work that's underrated and neglected. I like to dissect the AI risk into different market failures. Again, under each one I've listed some agendas that I think mitigate the risk.

Lemon markets: We might not know how aligned our AIs are, and then underestimate the risk. Examples here are insurance and fast food.

Evals to improve information.
Governance to improve information.

Positive and negative externalities: Sam Altman might be ok with increasing his personal risk of dying to AI in exchange for a great consumer product, but he might not consider the negative value of everyone else dying. An example here is climate change.

Governance to include externalities.
Social norms against misalignment.

Cognitive biases: Humans are pretty bad at reasoning with small probabilities, and that includes the small probability that the next AI model causes catastrophic harm. Examples here are gambling and terrorism.

Targeting worst-case alignment like AI control, agent foundations, and interpretability.
Targeting long-term alignment

After writing up this big-picture model, I realized I accidentally organized it into importance, tractability, and neglectedness. First, the AI doom. Then, ways to reduce AI doom. Then, why other people won't work on reducing AI doom. I have been a good EA. 😊

SummaryBot @ 2025-06-24T13:21 (+1)

Executive summary: This exploratory post outlines the author’s personal model of AI risk and Effective Altruism’s role in addressing it, emphasizing a structured, cause-neutral approach to AI safety grounded in a mix of high-level doom scenarios, potential mitigations, and systemic market failures, while acknowledging uncertainty and inviting alternative perspectives.

Key points:

The author sees existential AI risk—especially scenarios where “everyone dies”—as the most salient concern, though they recognize alternative AI risks (e.g., value lock-in, S-risks) are also worth investigating.
They categorize AI risk using Yoshua Bengio’s framework of intelligence, affordances, and goals, mapping each to specific mitigation agendas such as interpretability, alignment, and governance.
A core rationale for AI risk plausibility is framed in economic terms: systemic market failures like lemon markets, externalities, and cognitive biases may prevent actors from internalizing catastrophic AI risks.
Practical agendas include pausing AI development, improving evaluations, incentivizing safety research, and developing pro-human social norms to counteract these failures.
The author reflects that their model unintentionally mirrors the EA framework of importance, tractability, and neglectedness—starting from doom, identifying mitigations, and explaining why others don’t prioritize them.
The post is cautious and reflective in tone, aiming more to clarify personal reasoning than to assert universal conclusions, and encourages readers to critique or build on the model.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.