How do we solve the alignment problem?

By Joe_Carlsmith @ 2025-02-13T18:27 (+28)

This is a linkpost to https://joecarlsmith.substack.com/p/how-do-we-solve-the-alignment-problem

(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.)

We want the benefits that superintelligent AI agents could create. And some people are trying hard to build such agents. I expect efforts like this to succeed – and maybe, very soon.

But superintelligent AI agents might also be difficult to control. They are, to us, as adults to children, except much more so. In the same direction, relative to us, as advanced aliens; as demi-gods; as humans relative to ants. If such agents “go rogue” – if they start ignoring human instructions, resisting correction or shut-down, trying to escape from their operating environment, seeking unauthorized resources and other forms of power, etc – we might not be able to stop them.

Worse, because power/resources/freedom/survival etc are useful for many goals, superintelligent agents with a variety of different motivations would plausibly have incentives to go rogue in this way, suggesting that problems with AI motivations could easily lead to such behavior. And if this behavior goes uncorrected at scale, humans might lose control over civilization entirely – permanently, involuntarily, maybe violently. Superintelligent AI agents, acting on their own, would be the dominant actors on the planet. Humans would be sidelined, or dead.

Getting safe access to the benefits of superintelligence requires avoiding this kind of outcome. And this despite incentives among human actors to build more and more capable and agentic systems (and including: to do so faster than someone else), and despite the variety of actors that might proceed unsafely. Call this the “alignment problem.”

I’ve written, before, about why I’m worried about this problem.^[1] But I’ve said much less about how we might solve it. In this series of essays, I try to say more.^[2] Here’s a summary of the essays I’ve released thus far:

In the first essay, “What is it to solve the alignment problem?”, I define solving the alignment problem as: building full-blown superintelligent AI agents, and becoming able to safely elicit their main beneficial capabilities, while avoiding the sort of “loss of control” scenario discussed above. I also define some alternatives to both solving the problem and failing on the problem – namely, what I call “avoiding” the problem (i.e., not building superintelligent AI agents at all, and looking for other ways to get access to similar benefits), and “handling” the problem (namely, using superintelligent AI agents in more restricted ways, and looking for other ways to get access to the sort of benefits their full capabilities would unlock). I think these alternatives should be on the table too. I also contrast my definition of solving the problem with some more exacting standards – namely, what I call “safety at all scales,” “fully competitive safety,” “permanent safety,” "near-term safety," and “complete alignment.” And I discuss how solving the problem, in my less-exacting sense, fits into the bigger picture.
In the second essay, “When should we worry about AI power-seeking?”, I offer a framework for thinking about the conditions required for problematic power-seeking in AI agents – conditions on their agency, their motivations, and their incentives overall. I’m particularly interested in their overall incentives, which I think often go under-analyzed. The essay offers a framework for analyzing them – one that I think improves on the discussion of “instrumental convergence” in some of my previous work, and which helps clarify some of the traditional arguments for concern about loss of control. And this framework highlights the ongoing role for shaping both an AI’s motivations (“motivation control”) and its options (“option control”) in desirable ways. The alignment discourse often focuses on extreme cases of motivation control (“complete alignment”) and option control (robustness to arbitrarily bad motivations), neglecting the in-between. But the in-between is important — and it’s what “alignment” has looked like in the human world thus far.
In the third essay, “Paths and waystations in AI safety,” I offer a high-level framework for thinking about what we need to do to get from here to a solution to the alignment problem. Very roughly: we need to strengthen various key “security factors” in our civilization – in particular, what I call our “safety progress” (our ability to develop new levels of AI capability safely), our “risk evaluation” (our ability to track and forecast the level of risk that a given sort of AI capability development involves), and our “capability restraint” (our ability to steer and restrain AI capability development when doing so is necessary for maintaining safety). I also discuss various possible intermediate milestones (e.g. global pause, automated alignment research, whole brain emulation) that different strategies in this respect can focus on.
In the fourth essay, “AI for AI safety,” I argue for the crucial importance of what I call “AI for AI safety” – that is, the differential use of AI labor in strengthening the key security factors above. I frame this in terms of the interplay between two key feedback loops: the “AI capabilities feedback loop” (i.e., more capable AI labor applied to building more capable AI) and the “AI safety feedback loop” (i.e., more capable AI labor applied to improving our ability to handle advanced AI safely). “AI for AI safety” is about using the latter feedback loop to continually either outpace or restrain the former — a project, I suggest, that becomes crucially important as the capabilities feedback loop starts to kick into gear. I also highlight the notion of an “AI for AI safety sweet spot” – that is, a zone of capability development where AIs are capable enough to be radically useful for safety, but not capable enough to take over the world given our countermeasures – as an especially important place to slow down. And I briefly introduce what I see as the key concerns about AI for AI safety — both more fundamental concerns (centered on elicitation/evaluation failures, differential sabotage, and dangerous rogue options), and more practical concerns (centered on e.g. the available amounts of time and political will).
In the fifth essay, “Can we safely automate alignment research?”, I examine our prospects for safely automating research on AI alignment in particular. I think we have a real shot, and that success is crucially important. However: we need to figure out how to adequately evaluate AI-generated alignment research; we need to either avoid/prevent AIs actively scheming to undermine our alignment research efforts, or elicit safe, top-human-level alignment research even from scheming AIs; and we need to give ourselves the necessary time, create the necessary data, and make the necessary investment of compute, staff, effort, and other resources. The essay focuses, in particular, on evaluation problems. There, one of my key points is that some parts of alignment research are easier to evaluate than others. In particular, in my opinion, we should be especially optimistic about automating alignment research that we can evaluate via some combination of (a) empirical feedback loops, and (b) formal methods. And if we can succeed in this respect, we can do a ton of that type of alignment research to help us safely automate the rest.
In the sixth essay, “Giving AIs safe motivations,” I describe my current best-guess picture of what it looks like to give advanced AIs safe motivations. In my view, the core problem is ensuring that AIs safely generalize from safe inputs (i.e., contexts where there are no genuine options for successful rogue behavior) to dangerous inputs (i.e., contexts where there are genuine options for rogue behavior) on the first critical try. I offer a four-step approach to this problem, namely: (1) Instruction-following on safe inputs: ensure instruction-following on safe inputs, using accurate evaluations and reinforcement; (2) No alignment faking: ensure that AIs aren’t adversarially messing with your evidence about how they’ll generalize to dangerous inputs; (3) Science of non-adversarial generalization: learn how to understand and control how AIs generalize, until you’re rightly confident that they’ll generalize their instruction-following to relevant dangerous inputs; and (4) Good instructions: ensure that instructions on those dangerous inputs rule out rogue behavior. I also talk about the key tools we have available for helping with this project, and the key sub-challenges it needs to overcome. And I briefly discuss how the picture in the essay extends to fully eliciting beneficial AI capabilities, in addition to ensuring safety.

I may add more overall remarks here later. But I think it’s possible that my perspective on the series as a whole will change as I finish it. So for now, I’ll stick with a few notes.

First: the series is not a solution to the alignment problem. It’s more like: a high-level vision of how we get to a solution, and of what the space of possible solutions looks like. I, at least, have wanted more of this sort of vision over the years, and it feels at least clearer now, even if still disturbingly vague. And while many of my conclusions are not new, still: I wanted to think it through, and to write it down, for myself.

Second: as far as I can currently tell, one of the most important sources of controllable variance in the outcome, here, is the safety, efficacy, and scale of frontier AI labor that gets used for well-chosen, safety-relevant applications – e.g., alignment research, monitoring/oversight, risk evaluation, cybersecurity, hardening-against-AI-attack, coordination, governance, etc. In the series, I call this “AI for AI safety.” I think it’s a big part of the game. In particular: whether we can figure out how to do it well; and how much we invest in it, relative to pushing forward AI capabilities. AI companies, governments, and other actors with the potential to access and direct large amounts of compute have an especially important role to play, here. But I think that safety-focused efforts, in general, should place special emphasis on figuring out how to use safe AI labor as productively as possible – and especially if time is short, as early as possible – and then doing it.

Third: the discussion of “solutions” in the series might create a false sense of comfort. I am trying to chart the best paths forward. I am trying to figure out what will help most on the margin. And I am indeed more optimistic about our prospects than some vocal pessimists. But I want to be very clear: our current trajectory appears to me extremely dangerous. We are hurtling headlong towards the development of artificial agents that will plausibly be powerful enough to destroy everything we care about if we fail to control their options and motivations in the right way. And we do not know if we will be able to control their options and motivations in the right way. Nor are we on any clear track to have adequate mechanisms and political will for halting further AI development, if efforts at such control are failing, or are likely to fail if we continue forward.

And if we fail hard enough, then you, personally, will be killed, or forcibly disempowered. And not just you. Your family. Your friends. Everyone. And the human project will have failed forever.

These are the stakes. This is what fucking around with superintelligent agents means. And it looks, to me, like we’re at serious risk of fucking around.

I don’t know what will happen. I expect we’ll find out soon enough.

Here’s one more effort to help.

This series represents my personal views, not the views of my employer.

Thanks to Nick Beckstead, Sam Bowman, Catherine Brewer, Collin Burns, Joshua Clymer, Owen Cotton-Barratt, Ajeya Cotra, Tom Davidson, Sebastian Farquhar, Peter Favaloro, Lukas Finnveden, Katja Grace, Ryan Greenblatt, Evan Hubinger, Holden Karnofsky, Daniel Kokotajlo, Jan Leike, David Lorell, Max Nadeau, Richard Ngo, Buck Shlegeris, Rohin Shah, Carl Shulman, Nate Soares, John Wentworth, Mark Xu, and many others for comments and/or discussion. And thanks to Claude for comments and suggestions as well.

^{^}
In 2021, I wrote a report about it, and on the probability of failure; and in 2023, I wrote another report about the version that worries me most – what I called “scheming.”
^{^}
Some content in the series is drawn/adapted from content that I've posted previously on LessWrong and the EA Forum, though not on my website or substack. My aim with those earlier posts was to get fast, rough versions of my thinking out there on the early side; here I'm aiming to revise, shorten, and reconsider. And some of the content in the series is wholly new.

SummaryBot @ 2025-02-14T15:54 (+1)

Executive summary: Solving the AI alignment problem requires developing superintelligent AI that is both beneficial and controllable, avoiding catastrophic loss of human control; this series explores possible paths to achieving that goal, emphasizing the use of AI for AI safety.

Key points:

Superintelligent AI could bring immense benefits but poses existential risks if it becomes uncontrollable, potentially sidelining or destroying humanity.
The "alignment problem" is ensuring that superintelligent AI remains safe and aligned with human values despite competitive pressures to accelerate its development.
The author categorizes approaches into "solving" (full safety), "avoiding" (not developing superintelligent AI), and "handling" (restricting its use), arguing that all should be considered.
A critical factor in safety is the effective use of "AI for AI safety"—leveraging AI for risk evaluation, oversight, and governance to ensure alignment.
Despite efforts to outline solutions, the author remains deeply concerned about the current trajectory, fearing a lack of adequate control mechanisms and political will.
The stakes are existential: failure in alignment could lead to the irreversible destruction or subjugation of humanity, making urgent action imperative.

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.