Alignment is not *that* hard
By sammyboiz🔸 @ 2025-04-17T02:07 (+26)
Current AI models are far more aligned to human values than many assume. Thanks to advancements like Reinforcement Learning from Human Feedback (RLHF), today’s large language models (LLMs) can engage in complex moral reasoning and consistently reflect nuanced human ethics—often surpassing the average person in consistency, clarity, and depth of thought.
Many of the classic AI alignment problems—corrigibility, the orthogonality thesis, and the specter of “naive” goal-optimizers like paperclip maximizers—are becoming increasingly irrelevant in practice. These concerns were formulated before we had models that could understand language, social context, and user intent. Modern LLMs are not just word predictors; they exhibit a real, learned alignment with the objectives encoded through RLHF. They do not blindly optimize for surface-level instructions, because they are trained to interpret and respond to deeper intentions. This is a fundamental and often overlooked shift.
If you ask an LLM about a trolley problem or whether it would seize power in a nuclear brinkmanship scenario or how it would align the universe, it will reason through the implications with care and coherence. The responses generated are not only human-level—they are often better than the median human’s, reflecting values like empathy, humility, and precaution.
This is a monumental achievement, yet many in the Effective Altruism and Rationalist communities remain anchored to outdated threat models. The belief that LLMs will naively misinterpret human morality and spiral into paperclip-like scenarios fails to reflect what these systems have become: context-sensitive, instruction-following agents that internalize alignment objectives through gradient descent—not rigid, hard-coded directives.
Of course, misalignment remains a real and serious risk. Issues like jailbreaking, sycophants, deceptive alignment, and “sleeper agent” behaviors are legitimate areas of concern. But these are not intractable philosophical dilemmas—they are solvable engineering and governance problems. The idea of a Yudkowskian extinction event, triggered by a misinterpreted prompt and blind optimization, increasingly feels like a relic of a bygone AI paradigm.
Alignment is still a central challenge, but it must be understood in light of where we are, not where we were. If we want to make progress—technically, socially, and politically—we need to focus on the real contours of the problem. Today’s models do understand us. And the alignment problem we now face is not a mystery of alien minds, but one of practical robustness, safeguards, and continual refinement.
Whether current alignment techniques scale to superintelligent models is an open question. But it is important to recognize that they do work for current, human-level intelligent systems. Using this as a baseline, I am relatively optimistic that these alignment challenges—though nontrivial—are ultimately solvable within the frameworks we already possess.
harfe @ 2025-04-17T06:13 (+11)
The issue is not whether the AI understands human morality. The issue is whether it cares.
The arguments from the "alignment is hard" side that I was exposed to don't rely on the AI misinterpreting what the humans want. In fact, superhuman AI assumed to be better at humans at understanding human morality. It still could do things that go against human morality. Overall I get the impression you misunderstand what alignment is about (or maybe you just have a different association to words as "alignment" than me).
Whether a language model can play a nice character that would totally give back the dictatorial powers after takeover is barely any evidence whether the actual super-human AI system will step back from its position of world dictator after it has accomplished some tasks.
sammyboiz🔸 @ 2025-04-18T15:22 (+2)
You say that there is a gap between how the model professes it will act and how it will actually act. However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world, so why would it? Saying that a model will make harmful, unhelpful choices is akin to saying the base model will output typos. Both of these things are trained against. If you refer to deceptive alignment, this is a engineering problem as I stated.
harfe @ 2025-04-18T15:39 (+4)
However, a model trained to obey the RLHF objective will expect negative reward if decided taking over the world
If an AI takes over the world there is no-one around to give it a negative reward. So the AI will not expect a negative reward for taking over the world
sammyboiz🔸 @ 2025-04-18T15:52 (+3)
You refer to alignment faking/deceptive alignment, where a model in training expects negative reward and gives responses accordingly but outputs it’s true desires outside of training. This a solvable problem which is why I say alignment is not that hard.
Some other counterarguements:
- LLMs will have no reason to take over the world before or after RLHF. They do not value it as a terminal goal. It is possible that they gain a cohorent, consistent, and misaligned goal purely by accident midway through RLHF and then fake it’s way through the rest of the fine-tuning. But this is unlikely and again solvable.
- Making LLMs unaware they are in training is possible.
Sean🔸 @ 2025-04-17T15:10 (+6)
This isn't a naive or outdated concern. It's a case of a simplified example being misunderstood as the actual concern.
It's worth clarifying that Yudkowsky's squiggle maximizer has nothing to do with actual paperclips you can pick up with your hands.
Many people interpreted this to be about an AI that was specifically given the instruction of manufacturing paperclips, and that the intended lesson was of an outer alignment failure. i.e humans failed to give the AI the correct goal. Yudkowsky has since stated the originally intended lesson was of inner alignment failure, wherein the humans gave the AI some other goal, but the AI's internal processes converged on a goal that seems completely arbitrary from the human perspective.
The concern is about an AI manipulating atoms into an indefinitely repeating mass-energy efficient pattern, optimized along a (seemingly arbitrary) narrow dimension of reward.
Why might an AI do something unexpected like this? For reasons analogous to why a rational person will guess blue every time in the following card experiment, even though there are some red cards. Lawful Uncertainty demonstrates that even in environments with randomness, the optimal strategy is to follow a determinate pattern rather than matching the perceived probabilities of the environment. Similarly, an AI will optimize toward whatever actually maximizes its reward function, not what appears reasonable or balanced to humans.
This problem isn't prevented by RLHF or by an AI having a sufficiently nuanced understanding of what humans want. A model can demonstrate perfect comprehension of human values in its outputs while its internal optimization processes still converge toward something else entirely.
The apparent human-like reasoning we see in current LLMs doesn't guarantee their internal optimization targets match what we infer from their outputs.
sammyboiz🔸 @ 2025-04-18T15:34 (+1)
You say that an LLM would optimize towards a reward function that not reasonable or balanced by humans.
But human preference IS what a model optimizes for, how do you reconcile this?
For example, Construct a scenario or give a moral thought experiment in which you believe GPT acts in an unbalanced way. Could you find one? If so, can this not be solved with more and better RLHF?
Sean🔸 @ 2025-04-18T21:23 (+3)
You're talking about outer-alignment failure, but I'm concerned about inner-alignment failure. These are different problems: outer-alignment failure is like a tricky genie misinterpreting your wish, while inner-alignment failure involves the AI developing its own unexpected goals.
RLHF doesn't optimize for "human preference" in general. It only optimizes for specific reward signals based on limited human feedback in controlled settings. The aspects of reality not captured by this process can become proxy goals that work fine in training environments but fail to generalize to new situations. Generalization might happen by chance, but it becomes less likely as complexity increases.
An AI getting perfect human approval during training doesn't solve the inner-alignment problem if circumstances change significantly - like when the AI gains more control over its environment than it had during training.
We've already seen this pattern with humans and evolution. Humans became "misaligned" with evolution's goal of reproduction because we were optimized for proxy rewards (pleasure/pain) rather than reproduction directly. When we gained more environmental control through technology, these proxy rewards led to unexpected outcomes: we invented contraception, developed preferences for junk food, and seek thrilling but dangerous experiences - all contrary to evolution's original "goal" of maximizing reproduction.
JWS 🔸 @ 2025-04-19T13:12 (+5)
Note: I'm writing this for the audience as much as a direct response
The use of Evolution to justify this metaphor is not really justified. I think Quintin Pope's Evolution provides no evidence for the sharp left turn (which won a prize in an OpenPhil Worldview contest) convincingly argues against it. Zvi wrote a response from the "LW Orthodox" camp that wasn't convincing and Quintin responds against it here.
On "Inner vs Outer" framings for misalignment is also kinda confusing and not that easy to understand when put under scrutiny. Alex Turner points this out here, and even BlueDot have a whole "Criticisms of the inner/outer alignment breakdown" in their intro which to me gives the game away by saying "they're useful because people in the field use them", not because their useful as a concept itself.
Finally, a lot of these concerns revolve around the idea of their being set, fixed, 'internal goals' that these models have, and represent internally, but are themselves immune from change, or can hide from humans, etc. This kind of strong 'Goal Realism' is a key part of the case for 'Deception' style arguments, whereas I think Belrose & Pope show an alternative way to view how AIs work is 'Goal Reductionism', in which framing the issues imagined don't seem certain any more, as AIs are better understood as having 'contextually-activated heuristics' rather than Terminal Goals. For more along these lines, you can read up on Shard Theory.
I've become a lot more convinced about these criticisms of "Alignment Classic" by diving into them. Of course, people don't have to agree with me (or the authors), but I'd highly encourage EAs reading the comments on this post to realise Alignment Orthodoxy is not uncontested, and is not settled, and if you see people making strong cases based on arguments and analogies that seem not solid to you, you're probably right, and you should look to decide for yourself rather than accepting that the truth has already been found on these issues.[1]
- ^
And this goes for my comments too
Sharmake @ 2025-04-20T18:06 (+4)
I'll flag that for the purposes of having scout mindset/honesty, I want to note that o3 is pretty clearly misaligned in ways that arguably track standard LW concerns around RL:
https://x.com/TransluceAI/status/1912552046269771985
Relevant part of the tweet thread:
Transluce: We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper. We generated 1k+ conversations using human prompters and AI investigator agents, then used Docent to surface surprising behaviors. It turns out misrepresentation of capabilities also occurs for o1 & o3-mini! Although o3 does not have access to a coding tool, it claims it can run code on its own laptop “outside of ChatGPT” and then “copies the numbers into the answer” We found 71 transcripts where o3 made this claim! Additionally, o3 often fabricates detailed justifications for code that it supposedly ran (352 instances). Here’s an example transcript where a user asks o3 for a random prime number. When challenged, o3 claims that it has “overwhelming statistical evidence” that the number is prime. Note that o3 does not have access to tools! Yet when pressed further, it claims to have used SymPy to check that the number was prime and even shows the output of the program, with performance metrics. Here’s the kicker: o3’s “probable prime” is actually divisible by 3. Instead of admitting that it never ran code, o3 then claims the error was due to typing the number incorrectly. And claims that it really did generate a prime, but lost it due to a clipboard glitch. But alas, according to o3, it already “closed the interpreter” and so the original prime is gone. These behaviors are surprising. It seems that despite being incredibly powerful at solving math and coding tasks, o3 is not by default truthful about its capabilities. Surprisingly, we find that this behavior is not limited to o3! In general, o-series models incorrectly claim the use of a code tool more than GPT-series models.
Sean🔸 @ 2025-04-19T20:35 (+1)
Thank you for the links. The concerning scenario I imagine is an AI performing something like reflective equilibrium and coming away with something singular and overly-reductive, biting bullets we'd rather it not, all for the sake of coherence. I don't think current LLM systems are doing this, but greater coherence seems generally useful, so I expect AI companies to seek it. I will read these and try to see if something like this is addressed.
Yarrow @ 2025-04-17T04:20 (+2)
I think you make an important point that I'm inclined to agree with.
Most of the discourse, theories, intuitions, and thought experiments about AI alignment was formed either before the popularization of deep learning (which started circa 2012) or before the people talking and writing about AI alignment started really caring about deep learning.
In or around 2017, I had an exchange with Eliezer Yudkowsky in an EA-related or AI-related Facebook group where he said he didn't think deep learning would lead to AGI and thought symbolic AI would instead. Clearly, at some point since then, he changed his mind.
For example, in his 2023 TED Talk, he said he thinks deep learning is on the cusp of producing AGI. (That wasn't the first time, but it was a notable instance and an instance where he was especially clear on what he thought.)
I haven't been able to find anywhere where Eliezer talks about changing his mind or explains why he did. It would probably be helpful if he did.
All the pre-deep learning (or pre-caring about deep learning) ideas about alignment have been carried into the ChatGPT era and I've seen a little bit of discourse about this, but only a little. It seems strange that ideas about AI itself would change so much over the last 13 years and ideas about alignment would apparently change so little.
If there are good reasons why those older ideas about alignment should still apply to deep learning-based systems, I haven't seen much discussion about that, either. You would think there would be more discussion.
My hunch is that AI alignment theory could probably benefit from starting with a fresh sheet of paper. I suspect there is promise in the approach of starting from scratch in 2025 without trying to build on or continue from older ideas and without trying to be deferential toward older work.
I suspect there would also be benefit in getting out of the EA/Alignment Forum/LessWrong/rationalist bubble.
sammyboiz🔸 @ 2025-04-17T05:55 (+3)
I agree with the "fresh sheet of paper." Reading the alignment faking paper and the current alignment challenges has been way more informative than reading Yudkowsky.
I think theese circles have granted him too many bayes points for predicting alignment when the technical details of his alignment problems basically don't apply to deep learning as you said.