Why would AI companies use human-level AI to do alignment research?
By MichaelDickens @ 2025-04-25T19:12 (+16)
Cross-posted from my website.
Many plans for how to safely build superintelligent AI have a critical section that goes like this:
- Develop AI that's powerful enough to do AI research, but not yet powerful enough to pose an existential threat.
- Use it to assist with alignment research, thus greatly accelerating the pace of work—hopefully enough to solve all alignment problems.
You could call this process "alignment bootstrapping".
This is a central feature of DeepMind's plan (see "Amplified oversight"), Anthropic's plan (see "Scalable Oversight"), and independent plans written by Sam Bowman (an AI safety manager at Anthropic), Joshua Clymer (a researcher at Redwood Research), and Marius Hobbhahn (CEO of Apollo Research).
There are various reasons why alignment bootstrapping could fail[1] even if implemented well, and some of those plans acknowledge this. But I'm also concerned about whether alignment bootstrapping will be implemented at all.
When the time comes, will AI companies actually spend their resources on alignment bootstrapping?
When AI companies have human-level AI systems, will they use them for alignment research, or will they use them (mostly) to advance capabilities instead?
AI companies currently employ many human-level humans, and use a small percentage of them to do alignment research. If it makes sense for them to use most of their human-level AIs to do alignment research, wouldn't it also make sense to use most of their human researchers to do alignment research?
But they don't do that. Most of their human researchers work on advancing AI capabilities.
It's more likely that they use human-level AIs the same way they use human researchers: almost all of them work on accelerating capabilities, and a small minority work on safety. Which probably means capabilities outpace safety, which probably means we die.
Some companies argue that they need to advance capabilities right now to stay competitive. Perhaps that's true. Consider what the world will look like once the first company develops human-level AI. At that point, the #2 company will only be a few months behind at most. So the leading company will once again say, "Sorry, we can't use our human-level AI to work on alignment, we have to keep advancing capabilities to stay ahead." And they will continue saying this right up until their AI is powerful enough to kill everyone.
Counterpoint: AI companies would probably argue that present-day AIs are far from being dangerous. But human-level AIs will be nearly dangerous, so at that point it's too risky to keep advancing capabilities.
I would be more inclined to believe this if AI companies weren't already behaving so recklessly.[2] If you're going to prioritize safety over capabilities when the tradeoff becomes more critical, you should prove it to the world by prioritizing safety over capabilities right now.
Perhaps the ideal perfectly-altruistic AI company would indeed push capabilities right now and then switch to safety at the critical time,[3] but I see little reason to believe that that's what any of the real-life AI companies are going to do.
By my reading, none of the plans put probabilities on how concerning these reasons are. My guess is that, if alignment bootstrapping is implemented as these plans typically describe, then there's a greater than 50% chance that we die.
The purpose of this essay isn't to talk about the implementation problems with alignment bootstrapping, but in brief:
If your alignment-researcher AI is smarter than you, and you don't know how to align AI yet, then you can't trust that your AI is doing good work.
People who propose bootstrapping are usually aware of this problem. They have preliminary ideas for how they will evaluate the work of an AI that's smarter than them, coupled with bafflingly high confidence that their ideas will work. (Zvi proposed a test: "Can you get a method whereby the Man On The Street can use AI help to code and evaluate graduate level economics outputs and the quality of poetry and so on in ways that would translate to this future parallel situation?") ↩︎
I wanted to provide a link to a well-sourced and well-reasoned list of reckless behaviors by AI companies. I found no such list, so instead this is a link to a section of a post I wrote that includes numerous examples of reckless behavior. ↩︎
I don't actually think this is what safety-minded AI companies should do. I think they should spend less on capabilities and more on safety. But I am sympathetic to the position that they should temporarily focus on advancing capabilities. ↩︎
Ryan Greenblatt @ 2025-04-25T20:32 (+8)
When AI companies have human-level AI systems, will they use them for alignment research, or will they use them (mostly) to advance capabilities instead?
It's not clear this is a crux for the automating alignment research plan to work out.
In particular, suppose an AI company currently spends 5% of its resources on alignment research and will continue spending 5% when they have human level systems. You might think this suffices for alignment to keep pace with capabilities as the alignment labor force will get more powerful as alignment gets more difficult (and more important) due to higher levels of capability.
This doesn't mean this plan will necessarily work, it depends on the relative difficulty of advancing capabilities vs alignment. I'd naively guess that the probability of success just keeps going up the more resources you use for alignment.
There are some reasons for thinking automation of labor is particularly compelling in the alignment case relative to the capabilities case:
- There might be scalable solutions to alignment which effectively indefinitely resolve the research problem while expect that capabilities looks more like continuously making better and better algorithms.
- Safety research might benefit relatively more from labor (rather than compute) when compared to capabilities. Two reasons for this:
- Safety currently seems relatively more labor bottlenecked.
- We can in principle solve large fraction of safety/alignment with fully theoretical safety research without any compute while it seems harder to do purely theoretical capabilities research.
I do think that pausing further capabilities once we have human-ish-level AIs for even just a few years while we focus on safety would massively improve the situation. This currently seems unlikely to happen.
Another way to put this is that automating alignment research is a response in the following dialogue:
Bob: We won't have enough time to solve alignment because AI takeoff will go very fast due to AIs automating AI R&D (and AI labor generally accelerating AI progress through other mechanisms).
Alice: Actually, as AIs are accelerating AI R&D, they could also be accelerating alignment work, so it's not clear that accelerating AI progress due to AI R&D acceleration makes the situation very different. As AI progress speeds up, alignment progress might speed up by a similar amount. Or it could speed up by a greater amount due to compute bottlenecks hitting capabilities harder.