AI’s goals may not match ours

By Vishakha Agrawal, Algon, StevenKaas @ 2025-05-28T12:07 (+2)

Context: This is a linkpost for https://aisafety.info/questions/NM3I/6:-AI%E2%80%99s-goals-may-not-match-ours 

This is an article in the new intro to AI safety series from AISafety.info. We'd appreciate any feedback. The most up-to-date version of this article is on our website.

 

Making AI goals match our intentions is called the alignment problem.

There’s some ambiguity in the term “alignment”. For example, when people talk about “AI alignment” in the context of present-day AI systems, they generally mean controlling observable behaviors like: Can we make it impossible for the AI to say ethnic slurs? Or to advise you how to secretly dispose of a corpse? Although such restrictions are sometimes circumvented with "jailbreaks", on the whole, companies mostly do manage to avoid AI outputs that could harm people and threaten their brand reputation.

But "alignment" in smarter-than-human systems is a different question. For such systems to remain safe in extreme cases — if they become so smart that we can’t check their work and maybe can’t even keep them in our control — they'll have to value the right things at a deep level, based on well-grounded concepts that don’t lose their intended meanings even far outside the circumstances they were trained for.

Making that happen is an unsolved problem. Arguments about possible solutions to alignment get very complex and technical. But as we’ll see later in this introduction, many of the people who have researched AI and AI alignment on a deep level think we may fail to find a solution, and that may result in catastrophe.

Some of the main difficulties are:

Finally, on a higher level, the problem is hard because of some features of the strategic landscape, which the end of this introduction will discuss further. One such feature is that we may have only one chance to align a powerful AI, instead of trying over and over until we get it right. This is because superintelligent systems that end up with goals different from ours may work against us to achieve those goals.

 

Related