Why “Solving Alignment” Is Likely a Category Mistake
By Nate Sharpe @ 2025-05-06T20:56 (+45)
This is a linkpost to https://www.lesswrong.com/posts/wgENfqD8HgADgq4rv/why-solving-alignment-is-likely-a-category-mistake
A common framing of the AI alignment problem is that it's a technical hurdle to be overcome. A clever team at DeepMind or Anthropic would publish a paper titled "Alignment is All You Need," everyone would implement it, and we'd all live happily ever after in harmonious coexistence with our artificial friends.
I suspect this perspective constitutes a category mistake on multiple levels. Firstly, it presupposes that the aims, drives, and objectives of both the artificial general intelligence and what we aim to align it with can be simplified into a distinct and finite set of elements, a simplification I believe is unrealistic. Secondly, it treats both the AGI and the alignment target as if they were static systems. This is akin to expecting a single paper titled "The Solution to Geopolitical Stability" or "How to Achieve Permanent Marital Bliss." These are not problems that are solved; they are conditions that are managed, maintained, and negotiated on an ongoing basis.
The Problem of "Aligned To Whom?"
The phrase "AI alignment" is often used as shorthand for "AI that does what we want." But "we" is not a monolithic entity. Consider the potential candidates for the entity or values an AGI should be aligned with:
- Individual Users: Aligning to individual user preferences, however harmful or conflicting? This seems like a recipe for chaos or enabling malicious actors. We just experienced an example of how this can go wrong with the GPT-4o update and subsequent rollback, wherein positive user feedback resulted in an overly sycophantic model personality (that was still rated quite highly by many users!) with some serious negative consequences.
- Corporate/Shareholder Interests: Optimizes for proxy goals (e.g., profit, engagement) which predictably generate negative externalities. Subject to Goodhart's Law on a massive scale.
- Democratic Consensus: Aligning to the will of the majority? Historical precedent suggests this can lead to the oppression of minorities. Furthermore, democratic processes are slow, easily manipulated, and often struggle with complex, long-term issues.
- AI Developer Values: Aligning to the personal values of a small, unrepresentative group of engineers and researchers? This introduces the biases and blind spots of that specific group as the de facto global operating principles. We saw how this can go wrong with the Twitter Files - imagine if that was instead about controlling the values of AGI.
- Objective Morality/Coherent Extrapolated Volition: Assumes such concepts are well-defined, discoverable, and technically specifiable—all highly uncertain propositions that humanity has failed on thus far. And if we’re relying on AGI to figure this one out, I’m not sure how we could “check the proof” on this one, so we’d have to assume that the AGI was aligned…and we’re right back where we started.
This isn't merely a matter of picking the "right" option. The options conflict, and the very notion of a stable, universally agreed-upon target for alignment seems implausible a priori.
The Target is Moving
The second aspect of the category mistake is treating alignment as something you achieve rather than something you maintain. Consider these analogous complex systems:
- Parenting: No parent ever "solves" child-rearing. Any parent who tells you they have is either a) lying, b) deeply deluded, or c) has a remarkably compliant houseplant they're confusing with a human child. A well-behaved twelve-year-old might be secretly clearing their browsing history to hide their real internet behavior (definitely not anecdotal). The child who dutifully attends synagogue might become an atheist in college. Parents are constantly recalibrating, responding to new developments, and hoping their underlying values transmission works better than their specific behavioral controls.
- Marriage: About 40% of marriages end in divorce, and many more persist in mutual dissatisfaction. Even healthy relationships require constant maintenance and realignment as people change. The version of your spouse you married may share only partial values-overlap with the person they become twenty years later. Successful marriages don't "solve" alignment; they continually renegotiate it, reorienting their goals and expectations as life inevitably reshapes their circumstances. Alignment in marriage is a verb, not a noun—a continuous process rather than a stable state.
- Geopolitics: Nations form alliances when convenient and break them when priorities change. No nation has ever "solved" international relations because the system has too many moving parts and evolving agents. The only examples of permanent, stable alliances (Andorra/Spain/France, Bhutan/India, Monaco/France) are fascinating precisely because they are outliers, and notably involve entities with vastly asymmetrical power dynamics or unique geographic constraints. It’s nice to think that maybe humanity would be the France to AGI’s Monaco…but if anything the inverse seems more likely, with humanity being forcibly aligned to AGI.
- Economics: Despite centuries of economic theory, no country has built an economy immune to recessions, inflation, or inequality. Central banks and treasury departments engage in constant intervention and adjustment, not one-time fixes. Nobody has a "solution" to the economy in the way you have a solution to 2+2=4. It's an ongoing process of tinkering, observing, and occasionally just bracing for impact.
These examples illustrate what Dan Hendrycks (drawing on Rittel & Webber's 1973 work) has identified as the "wicked problem" nature of AI safety: problems that are "open-ended, carry ambiguous requirements, and often produce unintended consequences." Artificial general intelligence belongs squarely in this category of problems that resist permanent solutions.
The scale of the challenge with AGI is amplified by the potential power differential. I struggle to keep my ten-year-olds aligned with my values, and I'm considerably smarter and more powerful than they are. With AGI we're talking about creating intelligent, agentic systems, but unlike children they will be smarter, think faster, and be more numerous than us. We will change, they will change, the environment will change. Maintaining alignment will be a continuous, dynamic process.
This doesn't mean we should abandon alignment research. We absolutely need the best alignment techniques possible. But we should be clear-eyed about what success looks like: not a solved problem, but an ongoing, never-ending process of negotiation, adaptation, and correction. Perhaps given the misleading nature of the current nomenclature, using a different phrase such as Successfully Navigating AI Co-evolution would better capture the dynamic, relational, and inherently unpredictable nature of integrating AGI successfully with humanity.
MichaelDickens @ 2025-05-07T15:01 (+8)
I think this is an important point that's worth saying.
For what it's worth, I am not super pessimistic that "solving alignment" is something that can be solved in principle. But I'm quite concerned that the safety-minded AI companies seem to completely ignore the philosophical problems with AI alignment. They all operate under the assumption that alignment is purely an ML problem and they can solve it by basically doing ML research. Which I expect is false (credence: 70%).
Wei Dai has written some good stuff about the problem of "philosophical competence". See here for a collection of his writings on the topic.
Nate Sharpe @ 2025-05-07T18:52 (+3)
Thanks Michael, I hadn't seen Wei's comments before and that was quite a fruitful rabbit hole 😅
Max Clarke @ 2025-05-07T05:05 (+3)
This is akin to expecting a single paper titled "The Solution to Geopolitical Stability" or "How to Achieve Permanent Marital Bliss." These are not problems that are solved; they are conditions that are managed, maintained, and negotiated on an ongoing basis.
If I have an automated system which is generally competent at managing, maintaining and negotiating, then can I not say I have the solution to those things? This is the sense in which it means to solve alignment. It is a lofty goal, yes. I don't think it's incoherent, but I do tend to think that it means a system that both "wins" (against other systems, including humans) and is "aligned" (behaves to the greatest extent possible, while still winning, in the general best interest of <insert X>). Take that how you will - many do not think aiming for such a thing is advisable, due to the implications of "winning".
The "to whom" is either not considered part of alignment itself, or it is assumed to be 5. on your list. 1, 2, 3 & 4 would not typically considered "solving alignment", although 1. is sometimes advocated for by hardcore democracy believers. I personally think if it's not 5., then it's not really achieving anything of note, as it still leaves the world with many competing mutually-malign agents.
This is all to defend the technical (but arguably useless) meaning of "solving alignment". I agree with everything else in your post. It is absolutely a "wicked problem", involving competition with intelligent adversaries, an evolving landscape, and a moving target.