AI’s goals may not match ours

By Vishakha Agrawal, Algon @ 2025-05-28T12:07 (+2)

Context: This is a linkpost for https://aisafety.info/questions/NM3I/6:-AI%E2%80%99s-goals-may-not-match-ours

This is an article in the new intro to AI safety series from AISafety.info. We'd appreciate any feedback. The most up-to-date version of this article is on our website.

Making AI goals match our intentions is called the alignment problem.

There’s some ambiguity in the term “alignment”. For example, when people talk about “AI alignment” in the context of present-day AI systems, they generally mean controlling observable behaviors like: Can we make it impossible for the AI to say ethnic slurs? Or to advise you how to secretly dispose of a corpse? Although such restrictions are sometimes circumvented with "jailbreaks", on the whole, companies mostly do manage to avoid AI outputs that could harm people and threaten their brand reputation.

But "alignment" in smarter-than-human systems is a different question. For such systems to remain safe in extreme cases — if they become so smart that we can’t check their work and maybe can’t even keep them in our control — they'll have to value the right things at a deep level, based on well-grounded concepts that don’t lose their intended meanings even far outside the circumstances they were trained for.

Making that happen is an unsolved problem. Arguments about possible solutions to alignment get very complex and technical. But as we’ll see later in this introduction, many of the people who have researched AI and AI alignment on a deep level think we may fail to find a solution, and that may result in catastrophe.

Some of the main difficulties are:

We can’t see what an AI values, because current AI is not designed in the same way as a web browser, an operating system, or a word processor — rather, it is “grown”. Human programmers design the process that grows the AI. But that process consists of vast numbers of computations that automatically make huge numbers of small adjustments, based on what works best on each piece of training data. The result is something like an alien artifact made of skyscraper-sized spreadsheets full of numbers. We can see what all the numbers are, and we know they have some sort of patterns in them that result in sensible tax advice and cookie recipes. We just understand little about what those patterns are or how they work together. In that sense, we’re looking at a “black box”. And even if we do come to better understand what’s going on in an AI, that not only makes it easier to align it, but also to improve its capabilities and thereby its dangers.
More importantly, we don’t know what to tell the AI to value. We can say what we care about in a rough, inexact way. But human values are complex, and if your AI is superintelligently pursuing a goal, getting that goal slightly wrong can ruin everything. You can write a page specifying what it means for a car to be a good car, and forget to talk about noise, and end up with a great car that blasts your ears off. We could attempt to have the AIs themselves try to figure out what we care about. And we’ve had some success with current, not-very-powerful AI, by doing things like training it to maximize thumbs-up reactions from raters. But again, getting the criterion slightly wrong would cause disastrous outcomes eventually, in extreme circumstances, and experts don’t think they know how to solve this problem.
And crucially, we don’t know how to get the AI to actually value what we tell it. Even if we specified our values totally correctly, we'd still have the problem of how to “load them in”. Training an AI to act morally may just result in a system that pretends to act morally and avoids getting caught breaking the rules. If we understood how the system worked, we could inspect it and see. But as discussed above, we understand very little about current systems.

Finally, on a higher level, the problem is hard because of some features of the strategic landscape, which the end of this introduction will discuss further. One such feature is that we may have only one chance to align a powerful AI, instead of trying over and over until we get it right. This is because superintelligent systems that end up with goals different from ours may work against us to achieve those goals.

AI’s goals may not match ours

Related