How we could stumble into AI catastrophe

By Holden Karnofsky @ 2023-01-16T14:52 (+83)

This is a linkpost to https://www.cold-takes.com/how-we-could-stumble-into-ai-catastrophe/

This post will lay out a couple of stylized stories about how, if transformative AI is developed relatively soon, this could result in global catastrophe. (By “transformative AI,” I mean AI powerful and capable enough to bring about the sort of world-changing consequences I write about in my most important century series.)

This piece is more about visualizing possibilities than about providing arguments. For the latter, I recommend the rest of this series.

In the stories I’ll be telling, the world doesn't do much advance preparation or careful consideration of risks I’ve discussed previously, especially re: misaligned AI (AI forming dangerous goals of its own).

People do try to “test” AI systems for safety, and they do need to achieve some level of “safety” to commercialize. When early problems arise, they react to these problems.
But this isn’t enough, because of some unique challenges of measuring whether an AI system is “safe,” and because of the strong incentives to race forward with scaling up and deploying AI systems as fast as possible.
So we end up with a world run by misaligned AI - or, even if we’re lucky enough to avoid that outcome, other catastrophes are possible.

After laying these catastrophic possibilities, I’ll briefly note a few key ways we could do better, mostly as a reminder (these topics were covered in previous posts). Future pieces will get more specific about what we can be doing today to prepare.

Backdrop

This piece takes a lot of previous writing I’ve done as backdrop. Two key important assumptions (click to expand) are below; for more, see the rest of this series.

(Click to expand) “Most important century” assumption: we’ll soon develop very powerful AI systems, along the lines of what I previously called PASTA.

In the most important century series, I argued that the 21st century could be the most important century ever for humanity, via the development of advanced AI systems that could dramatically speed up scientific and technological advancement, getting us more quickly than most people imagine to a deeply unfamiliar future.

I focus on a hypothetical kind of AI that I call PASTA, or Process for Automating Scientific and Technological Advancement. PASTA would be AI that can essentially automate all of the human activities needed to speed up scientific and technological advancement.

Using a variety of different forecasting approaches, I argue that PASTA seems more likely than not to be developed this century - and there’s a decent chance (more than 10%) that we’ll see it within 15 years or so.

I argue that the consequences of this sort of AI could be enormous: an explosion in scientific and technological progress. This could get us more quickly than most imagine to a radically unfamiliar future.

I’ve also argued that AI systems along these lines could defeat all of humanity combined, if (for whatever reason) they were aimed toward that goal.

For more, see the most important century landing page. The series is available in many formats, including audio; I also provide a summary, and links to podcasts where I discuss it at a high level.-->

(Click to expand) “Nearcasting” assumption: such systems will be developed in a world that’s otherwise similar to today’s.

It’s hard to talk about risks from transformative AI because of the many uncertainties about when and how such AI will be developed - and how much the (now-nascent) field of “AI safety research” will have grown by then, and how seriously people will take the risk, etc. etc. etc. So maybe it’s not surprising that estimates of the “misaligned AI” risk range from ~1% to ~99%.

This piece takes an approach I call nearcasting: trying to answer key strategic questions about transformative AI, under the assumption that such AI arrives in a world that is otherwise relatively similar to today's.

You can think of this approach like this: “Instead of asking where our ship will ultimately end up, let’s start by asking what destination it’s pointed at right now.”

That is: instead of trying to talk about an uncertain, distant future, we can talk about the easiest-to-visualize, closest-to-today situation, and how things look there - and then ask how our picture might be off if other possibilities play out. (As a bonus, it doesn’t seem out of the question that transformative AI will be developed extremely soon - 10 years from now or faster.^[1] If that’s the case, it’s especially urgent to think about what that might look like.)

How we could stumble into catastrophe from misaligned AI

This is my basic default picture for how I imagine things going, if people pay little attention to the sorts of issues discussed previously. I’ve deliberately written it to be concrete and visualizable, which means that it’s very unlikely that the details will match the future - but hopefully it gives a picture of some of the key dynamics I worry about.

Throughout this hypothetical scenario (up until “END OF HYPOTHETICAL SCENARIO”), I use the present tense (“AIs do X”) for simplicity, even though I’m talking about a hypothetical possible future.

Early commercial applications. A few years before transformative AI is developed, AI systems are being increasingly used for a number of lucrative, useful, but not dramatically world-changing things.

I think it’s very hard to predict what these will be (harder in some ways than predicting longer-run consequences, in my view),^[2] so I’ll mostly work with the simple example of automating customer service.

In this early stage, AI systems often have pretty narrow capabilities, such that the idea of them forming ambitious aims and trying to defeat humanity seems (and actually is) silly. For example, customer service AIs are mostly language models that are trained to mimic patterns in past successful customer service transcripts, and are further improved by customers giving satisfaction ratings in real interactions. The dynamics I described in an earlier piece, in which AIs are given increasingly ambitious goals and challenged to find increasingly creative ways to achieve them, don’t necessarily apply.

Early safety/alignment problems. Even with these relatively limited AIs, there are problems and challenges that could be called “safety issues” or “alignment issues.” To continue with the example of customer service AIs, these AIs might:

Give false information about the products they’re providing support for. (Example of reminiscent behavior)
Give customers advice (when asked) on how to do unsafe or illegal things. (Example)
Refuse to answer valid questions. (This could result from companies making attempts to prevent the above two failure modes - i.e., AIs might be penalized heavily for saying false and harmful things, and respond by simply refusing to answer lots of questions).
Say toxic, offensive things in response to certain user queries (including from users deliberately trying to get this to happen), causing bad PR for AI developers. (Example)

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This works well, as far as anyone can tell: the above problems become a lot less frequent. Some people see this as cause for great celebration, saying things like “We were worried that AI companies wouldn’t invest enough in safety, but it turns out that the market takes care of it - to have a viable product, you need to get your systems to be safe!”

People like me disagree - training AIs to behave in ways that are safer as far as we can tell is the kind of “solution” that I’ve worried could create superficial improvement while big risks remain in place.

(Click to expand) Why AI safety could be hard to measure

In previous pieces, I argued that:

If we develop powerful AIs via ambitious use of the “black-box trial-and-error” common in AI development today, then there’s a substantial risk that:

These AIs will develop unintended aims (states of the world they make calculations and plans toward, as a chess-playing AI "aims" for checkmate);
These AIs could deceive, manipulate, and even take over the world from humans entirely as needed to achieve those aims.
People today are doing AI safety research to prevent this outcome, but such research has a number of deep difficulties:

“Great news - I’ve tested this AI and it looks safe.” Why might we still have a problem?
Problem	Key question	Explanation
The Lance Armstrong problem	Did we get the AI to be actually safe or good at hiding its dangerous actions?	When dealing with an intelligent agent, it’s hard to tell the difference between “behaving well” and “appearing to behave well.” When professional cycling was cracking down on performance-enhancing drugs, Lance Armstrong was very successful and seemed to be unusually “clean.” It later came out that he had been using drugs with an unusually sophisticated operation for concealing them.
The King Lear problem	The AI is (actually) well-behaved when humans are in control. Will this transfer to when AIs are in control?	It's hard to know how someone will behave when they have power over you, based only on observing how they behave when they don't. AIs might behave as intended as long as humans are in control - but at some future point, AI systems might be capable and widespread enough to have opportunities to take control of the world entirely. It's hard to know whether they'll take these opportunities, and we can't exactly run a clean test of the situation. Like King Lear trying to decide how much power to give each of his daughters before abdicating the throne.
The lab mice problem	Today's "subhuman" AIs are safe.What about future AIs with more human-like abilities?	Today's AI systems aren't advanced enough to exhibit the basic behaviors we want to study, such as deceiving and manipulating humans. Like trying to study medicine in humans by experimenting only on lab mice.
The first contact problem	Imagine that tomorrow's "human-like" AIs are safe. How will things go when AIs have capabilities far beyond humans'?	AI systems might (collectively) become vastly more capable than humans, and it's ... just really hard to have any idea what that's going to be like. As far as we know, there has never before been anything in the galaxy that's vastly more capable than humans in the relevant ways! No matter what we come up with to solve the first three problems, we can't be too confident that it'll keep working if AI advances (or just proliferates) a lot more. Like trying to plan for first contact with extraterrestrials (this barely feels like an analogy).

An analogy that incorporates these challenges is Ajeya Cotra’s “young businessperson” analogy:

Imagine you are an eight-year-old whose parents left you a $1 trillion company and no trusted adult to serve as your guide to the world. You must hire a smart adult to run your company as CEO, handle your life the way that a parent would (e.g. decide your school, where you’ll live, when you need to go to the dentist), and administer your vast wealth (e.g. decide where you’ll invest your money).

You have to hire these grownups based on a work trial or interview you come up with -- you don't get to see any resumes, don't get to do reference checks, etc. Because you're so rich, tons of people apply for all sorts of reasons. (More)

If your applicants are a mix of "saints" (people who genuinely want to help), "sycophants" (people who just want to make you happy in the short run, even when this is to your long-term detriment) and "schemers" (people who want to siphon off your wealth and power for themselves), how do you - an eight-year-old - tell the difference?

More: AI safety seems hard to measure

(So far, what I’ve described is pretty similar to what’s going on today. The next bit will discuss hypothetical future progress, with AI systems clearly beyond today’s.)

Approaching transformative AI. Time passes. At some point, AI systems are playing a huge role in various kinds of scientific research - to the point where it often feels like a particular AI is about as helpful to a research team as a top human scientist would be (although there are still important parts of the work that require humans).

Some particularly important (though not exclusive) examples:

AIs are near-autonomously writing papers about AI, finding all kinds of ways to improve the efficiency of AI algorithms.
AIs are doing a lot of the work previously done by humans at Intel (and similar companies), designing ever-more efficient hardware for AI.
AIs are also extremely helpful with AI safety research. They’re able to do most of the work of writing papers about things like digital neuroscience (how to understand what’s going on inside the “digital brain” of an AI) and limited AI (how to get AIs to accomplish helpful things while limiting their capabilities).
- However, this kind of work remains quite niche (as I think it is today), and is getting far less attention and resources than the first two applications. Progress is made, but it’s slower than progress on making AI systems more powerful.

AI systems are now getting bigger and better very quickly, due to dynamics like the above, and they’re able to do all sorts of things.

At some point, companies start to experiment with very ambitious, open-ended AI applications, like simply instructing AIs to “Design a new kind of car that outsells the current ones” or “Find a new trading strategy to make money in markets.” These get mixed results, and companies are trying to get better results via further training - reinforcing behaviors that perform better. (AIs are helping with this, too, e.g. providing feedback and reinforcement for each others’ outputs^[3] and helping to write code^[4] for the training processes.)

This training strengthens the dynamics I discussed in a previous post: AIs are being rewarded for getting successful outcomes as far as human judges can tell, which creates incentives for them to mislead and manipulate human judges, and ultimately results in forming ambitious goals of their own to aim for.

More advanced safety/alignment problems. As the scenario continues to unfold, there are a number of concerning events that point to safety/alignment problems. These mostly follow the form: “AIs are trained using trial and error, and this might lead them to sometimes do deceptive, unintended things to accomplish the goals they’ve been trained to accomplish.”

Things like:

AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.
AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.
AIs forming relationships with the humans training them, and trying (sometimes successfully) to emotionally manipulate the humans into giving positive feedback on their behavior. They also might try to manipulate the humans into running more copies of them, into refusing to shut them off, etc.- things that are generically useful for the AIs’ achieving whatever aims they might be developing.

(Click to expand) Why AIs might do deceptive, problematic things like this

In a previous piece, I highlighted that modern AI development is essentially based on "training" via trial-and-error. To oversimplify, you can imagine that:

An AI system is given some sort of task.
The AI system tries something, initially something pretty random.
The AI system gets information about how well its choice performed, and/or what would’ve gotten a better result. Based on this, it adjusts itself. You can think of this as if it is “encouraged/discouraged” to get it to do more of what works well.
- Human judges may play a significant role in determining which answers are encouraged vs. discouraged, especially for fuzzy goals like “Produce helpful scientific insights.”
After enough tries, the AI system becomes good at the task.
But nobody really knows anything about how or why it’s good at the task now. The development work has gone into building a flexible architecture for it to learn well from trial-and-error, and into “training” it by doing all of the trial and error. We mostly can’t “look inside the AI system to see how it’s thinking.”

I then argue that:

Because we ourselves will often be misinformed or confused, we will sometimes give negative reinforcement to AI systems that are actually acting in our best interests and/or giving accurate information, and positive reinforcement to AI systems whose behavior deceives us into thinking things are going well. This means we will be, unwittingly, training AI systems to deceive and manipulate us.
For this and other reasons, powerful AI systems will likely end up with aims other than the ones we intended. Training by trial-and-error is slippery: the positive and negative reinforcement we give AI systems will probably not end up training them just as we hoped.

There are a number of things such AI systems might end up aiming for, such as:

Power and resources. These tend to be useful for most goals, such that AI systems could be quite consistently be getting better reinforcement when they habitually pursue power and resources.
Things like “digital representations of human approval” (after all, every time an AI gets positive reinforcement, there’s a digital representation of human approval).

In sum, we could be unwittingly training AI systems to accumulate power and resources, get good feedback from humans, etc. - even when this means deceiving and manipulating humans to do so.

More: Why would AI "aim" to defeat humanity?

“Solutions” to these safety/alignment problems. When problems like the above are discovered, AI companies tend to respond similarly to how they did earlier:

Training AIs against the undesirable behavior.
Trying to create more (simulated) situations under which AIs might behave in these undesirable ways, and training them against doing so.

These methods “work” in the sense that the concerning events become less frequent - as far as we can tell. But what’s really happening is that AIs are being trained to be more careful not to get caught doing things like this, and to build more sophisticated models of how humans can interfere with their plans.

In fact, AIs are gaining incentives to avoid incidents like “Doing something counter to human developers’ intentions in order to get positive feedback, and having this be discovered and given negative feedback later” - and this means they are starting to plan more and more around the long-run consequences of their actions. They are thinking less about “Will I get positive feedback at the end of the day?” and more about “Will I eventually end up in a world where humans are going back, far in the future, to give me retroactive negative feedback for today’s actions?” This might give direct incentives to start aiming for eventual defeat of humanity, since defeating humanity could allow AIs to give themselves lots of retroactive positive feedback.

One way to think about it: AIs being trained in this way are generally moving from “Steal money whenever there’s an opportunity” to “Don’t steal money if there’s a good chance humans will eventually uncover this - instead, think way ahead and look for opportunities to steal money and get away with it permanently.” The latter could include simply stealing money in ways that humans are unlikely to ever notice; it might also include waiting for an opportunity to team up with other AIs and disempower humans entirely, after which a lot more money (or whatever) can be generated.

Debates. The leading AI companies are aggressively trying to build and deploy more powerful AI, but a number of people are raising alarms and warning that continuing to do this could result in disaster. Here’s a stylized sort of debate that might occur:

A: Great news, our AI-assisted research team has discovered even more improvements than expected! We should be able to build an AI model 10x as big as the state of the art in the next few weeks.

B: I’m getting really concerned about the direction this is heading. I’m worried that if we make an even bigger system and license it to all our existing customers - military customers, financial customers, etc. - we could be headed for a disaster.

A: Well the disaster I’m trying to prevent is competing AI companies getting to market before we do.

B: I was thinking of AI defeating all of humanity.

A: Oh, I was worried about that for a while too, but our safety training has really been incredibly successful.

B: It has? I was just talking to our digital neuroscience lead, and she says that even with recent help from AI “virtual scientists,” they still aren’t able to reliably read a single AI’s digital brain. They were showing me this old incident report where an AI stole money, and they spent like a week analyzing that AI and couldn’t explain in any real way how or why that happened.

(Click to expand) How "digital neuroscience" could help

I’ve argued that it could be inherently difficult to measure whether AI systems are safe, for reasons such as: AI systems that are not deceptive probably look like AI systems that are so good at deception that they hide all evidence of it, in any way we can easily measure.

Unless we can “read their minds!”

Currently, today’s leading AI research is in the genre of “black-box trial-and-error.” An AI tries a task; it gets “encouragement” or “discouragement” based on whether it does the task well; it tweaks the wiring of its “digital brain” to improve next time; it improves at the task; but we humans aren’t able to make much sense of its “digital brain” or say much about its “thought process.”

Some AI research (example)² is exploring how to change this - how to decode an AI system’s “digital brain.” This research is in relatively early stages - today, it can “decode” only parts of AI systems (or fully decode very small, deliberately simplified AI systems).

As AI systems advance, it might get harder to decode them - or easier, if we can start to use AI for help decoding AI, and/or change AI design techniques so that AI systems are less “black box”-ish.

A: I agree that’s unfortunate, but digital neuroscience has always been a speculative, experimental department. Fortunately, we have actual data on safety. Look at this chart - it shows the frequency of concerning incidents plummeting, and it’s extraordinarily low now. In fact, the more powerful the AIs get, the less frequent the incidents get - we can project this out and see that if we train a big enough model, it should essentially never have a concerning incident!

B: But that could be because the AIs are getting cleverer, more patient and long-term, and hence better at ensuring we never catch them.

(Click to expand) The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions?

Let's imagine that:

We have AI systems available that can do roughly everything a human can, with some different strengths and weaknesses but no huge difference in "overall capabilities" or economic value per hour of work.
We're observing early signs that AI systems behave in unintended, deceptive ways, such as giving wrong answers to questions we ask, or writing software that falsifies metrics instead of doing the things the metrics were supposed to measure (e.g., software meant to make a website run faster might instead falsify metrics about its loading time).

We theorize that modifying the AI training in some way⁶ will make AI systems less likely to behave deceptively. We try it out, and find that, in fact, our AI systems seem to be behaving better than before - we are finding fewer incidents in which they behaved in unintended or deceptive ways.

But that's just a statement about what we're noticing. Which of the following just happened:

Did we just train our AI systems to be less deceptive?
Did we just train our AI systems to be better at deception, and so to make us think they became less deceptive?
Did we just train our AI systems to be better at calculating when they might get caught in deception, and so to be less deceptive only when the deception would otherwise be caught?
- This one could be useful! Especially if we're able to set up auditing systems in many real-world situations, such that we could expect deception to be caught a lot of the time. But it does leave open the King Lear problem.

(...Or some combination of the three?)

We're hoping to be able to deploy AI systems throughout the economy, so - just like human specialists - they will almost certainly have some opportunities to be deceptive without being caught. The fact that they appear honest in our testing is not clear comfort against this risk.

The analogy here is to competitive cyclist Lance Armstrong. Armstrong won the Tour de France race 7 times in a row, while many of his competitors were caught using performance-enhancing drugs and disqualified. But more than 5 years after his last win, an investigation "concluded that Armstrong had used performance-enhancing drugs over the course of his career[5] and named him as the ringleader of 'the most sophisticated, professionalized and successful doping program that sport has ever seen'." Now the list of Tour de France winners looks like this:

alt_text

… Or just that they’re now advanced enough that they’re waiting for a chance to disempower humanity entirely, rather than pull a bunch of small-time shenanigans that tip us off to the danger.

(Click to expand) The King Lear problem: how do you test what will happen when it's no longer a test?

The Shakespeare play King Lear opens with the King (Lear) stepping down from the throne, and immediately learning that he has left his kingdom to the wrong two daughters. Loving and obsequious while he was deciding on their fate,⁹ they reveal their contempt for him as soon as he's out of power and they're in it.

If we're building AI systems that can reason like humans, dynamics like this become a potential issue.

I previously noted that an AI with any ambitious aim - or just an AI that wants to avoid being shut down or modified - might calculate that the best way to do this is by behaving helpfully and safely in all "tests" humans can devise. But once there is a real-world opportunity to disempower humans for good, that same aim could cause the AI to disempower humans.

In other words:

(A) When we're developing and testing AI systems, we have the power to decide which systems will be modified or shut down and which will be deployed into the real world. (Like King Lear deciding who will inherit his kingdom.)
(B) But at some later point, these systems could be operating in the economy, in high numbers with a lot of autonomy. (This possibility is spelled out/visualized a bit more here and here.) At that point, they may have opportunities to defeat all of humanity such that we never make decisions about them again. (Like King Lear's daughters after they've taken control.)

If AI systems can detect the difference between (A) and (B) above, then their behavior might systematically change from one to the other - and there's no clear way to test their behavior in (B).

A: What’s your evidence for this?

B: I think you’ve got things backward - we should be asking what’s our evidence *against* it. By continuing to scale up and deploy AI systems, we could be imposing a risk of utter catastrophe on the whole world. That’s not OK - we should be confident that the risk is low before we move forward.

A: But how would we even be confident that the risk is low?

B: I mean, digital neuroscience -

A: Is an experimental, speculative field!

B: We could also try some other stuff …

A: All of that stuff would be expensive, difficult and speculative.

B: Look, I just think that if we can’t show the risk is low, we shouldn’t be moving forward at this point. The stakes are incredibly high, as you yourself have acknowledged - when pitching investors, you’ve said we think we can build a fully general AI and that this would be the most powerful technology in history. Shouldn’t we be at least taking as much precaution with potentially dangerous AI as people take with nuclear weapons?

A: What would that actually accomplish? It just means some other, less cautious company is going to go forward.

B: What about approaching the government and lobbying them to regulate all of us?

A: Regulate all of us to just stop building more powerful AI systems, until we can address some theoretical misalignment concern that we don’t know how to address?

B: Yes?

A: All that’s going to happen if we do that is that other countries are going to catch up to the US. Think [insert authoritarian figure from another country] is going to adhere to these regulations?

B: It would at least buy some time?

A: Buy some time and burn our chance of staying on the cutting edge. While we’re lobbying the government, our competitors are going to be racing forward. I’m sorry, this isn’t practical - we’ve got to go full speed ahead.

B: Look, can we at least try to tighten our security? If you’re so worried about other countries catching up, we should really not be in a position where they can send in a spy and get our code.

A: Our security is pretty intense already.

B: Intense enough to stop a well-resourced state project?

A: What do you want us to do, go to an underground bunker? Use airgapped servers (servers on our premises, entirely disconnected from the public Internet)? It’s the same issue as before - we’ve got to stay ahead of others, we can’t burn huge amounts of time on exotic security measures.

B: I don’t suppose you’d at least consider increasing the percentage of our budget and headcount that we’re allocating to the “speculative” safety research? Or are you going to say that we need to stay ahead and can’t afford to spare resources that could help with that?

A: Yep, that’s what I’m going to say.

Mass deployment. As time goes on, many versions of the above debate happen, at many different stages and in many different places. By and large, people continue rushing forward with building more and more powerful AI systems and deploying them all throughout the economy.

At some point, there are AIs that closely manage major companies’ financials, AIs that write major companies’ business plans, AIs that work closely with politicians to propose and debate laws, AIs that manage drone fleets and develop military strategy, etc. Many of these AIs are primarily built, trained, and deployed by other AIs, or by humans leaning heavily on AI assistance.

More intense warning signs.

(Note: I think it’s possible that progress will accelerate explosively enough that we won’t even get as many warning signs as there are below, but I’m spelling out a number of possible warning signs anyway to make the point that even intense warning signs might not be enough.)

Over time, in this hypothetical scenario, digital neuroscience becomes more effective. When applied to a randomly sampled AI system, it often appears to hint at something like: “This AI appears to be aiming for as much power and influence over the world as possible - which means never doing things humans wouldn’t like if humans can detect it, but grabbing power when they can get away with it.”

(Click to expand) Why would AI "aim" to defeat humanity?

A previous piece argued that if today’s AI development methods lead directly to powerful enough AI systems, disaster is likely by default (in the absence of specific countermeasures).

In brief:

Modern AI development is essentially based on “training” via trial-and-error.
If we move forward incautiously and ambitiously with such training, and if it gets us all the way to very powerful AI systems, then such systems will likely end up aiming for certain states of the world (analogously to how a chess-playing AI aims for checkmate).
And these states will be other than the ones we intended, because our trial-and-error training methods won’t be accurate. For example, when we’re confused or misinformed about some question, we’ll reward AI systems for giving the wrong answer to it - unintentionally training deceptive behavior.
We should expect disaster if we have AI systems that are both (a) powerful enough to defeat humans and (b) aiming for states of the world that we didn’t intend. (“Defeat” means taking control of the world and doing what’s necessary to keep us out of the way; it’s unclear to me whether we’d be literally killed or just forcibly stopped¹ from changing the world in ways that contradict AI systems’ aims.)

More: Why would AI "aim" to defeat humanity?

However, there is room for debate in what a “digital brain” truly shows:

Many people are adamant that the readings are unreliable and misleading.
Some people point out that humans are also interested in power and influence, and often think about what they can and can’t get away with, but this doesn’t mean they’d take over the world if they could. They say the AIs might be similar.
There are also cases of people doing digital neuroscience that claims to show that AIs are totally safe. These could be people like “A” above who want to focus on pushing forward with AI development rather than bringing it to a halt, or people who just find the alarmists annoying and like to contradict them, or people who are just sloppy with their research. Or people who have been manipulated or bribed by AIs themselves.

There are also very concerning incidents, such as:

An AI steals a huge amount of money by bypassing the security system at a bank - and it turns out that this is because the security system was disabled by AIs at the bank. It’s suspected, maybe even proven, that all these AIs had been communicating and coordinating with each other in code, such that humans would have difficulty detecting it. (And they had been aiming to divide up the funds between the different participating AIs, each of which could stash them in a bank account and use them to pursue whatever unintended aims they might have.)
An obscure new political party, devoted to the “rights of AIs,” completely takes over a small country, and many people suspect that this party is made up mostly or entirely of people who have been manipulated and/or bribed by AIs.
There are companies that own huge amounts of AI servers and robot-operated factories, and are aggressively building more. Nobody is sure what the AIs or the robots are “for,” and there are rumors that the humans “running” the company are actually being bribed and/or threatened to carry out instructions (such as creating more and more AIs and robots) that they don’t understand the purpose of.

At this point, there are a lot of people around the world calling for an immediate halt to AI development. But:

Others resist this on all kinds of grounds, e.g. “These concerning incidents are anomalies, and what’s important is that our country keeps pushing forward with AI before others do,” etc.
Anyway, it’s just too late. Things are moving incredibly quickly; by the time one concerning incident has been noticed and diagnosed, the AI behind it has been greatly improved upon, and the total amount of AI influence over the economy has continued to grow.

Defeat.

(Noting again that I could imagine things playing out a lot more quickly and suddenly than in this story.)

It becomes more and more common for there to be companies and even countries that are clearly just run entirely by AIs - maybe via bribed/threatened human surrogates, maybe just forcefully (e.g., robots seize control of a country’s military equipment and start enforcing some new set of laws).

At some point, it’s best to think of civilization as containing two different advanced species - humans and AIs - with the AIs having essentially all of the power, making all the decisions, and running everything.

Spaceships start to spread throughout the galaxy; they generally don’t contain any humans, or anything that humans had meaningful input into, and are instead launched by AIs to pursue aims of their own in space.

Maybe at some point humans are killed off, largely due to simply being a nuisance, maybe even accidentally (as humans have driven many species of animals extinct while not bearing them malice). Maybe not, and we all just live under the direction and control of AIs with no way out.

What do these AIs do with all that power? What are all the robots up to? What are they building on other planets? The short answer is that I don’t know.

Maybe they’re just creating massive amounts of “digital representations of human approval,” because this is what they were historically trained to seek (kind of like how humans sometimes do whatever it takes to get drugs that will get their brains into certain states).
Maybe they’re competing with each other for pure power and territory, because their training has encouraged them to seek power and resources when possible (since power and resources are generically useful, for almost any set of aims).
Maybe they have a whole bunch of different things they value, as humans do, that are sort of (but only sort of) related to what they were trained on (as humans tend to value things like sugar that made sense to seek out in the past). And they’re filling the universe with these things.

(Click to expand) What sorts of aims might AI systems have?

In a previous piece, I discuss why AI systems might form unintended, ambitious "aims" of their own. By "aims," I mean particular states of the world that AI systems make choices, calculations and even plans to achieve, much like a chess-playing AI “aims” for a checkmate position.

An analogy that often comes up on this topic is that of human evolution. This is arguably the only previous precedent for a set of minds [humans], with extraordinary capabilities [e.g., the ability to develop their own technologies], developed essentially by black-box trial-and-error [some humans have more ‘reproductive success’ than others, and this is the main/only force shaping the development of the species].

You could sort of¹² think of the situation like this: “An AI¹³ developer named Natural Selection tried giving humans positive reinforcement (making more of them) when they had more reproductive success, and negative reinforcement (not making more of them) when they had less. One might have thought this would lead to humans that are aiming to have reproductive success. Instead, it led to humans that aim - often ambitiously and creatively - for other things, such as power, status, pleasure, etc., and even invent things like birth control to get the things they’re aiming for instead of the things they were ‘supposed to’ aim for.”

Similarly, if our main strategy for developing powerful AI systems is to reinforce behaviors like “Produce technologies we find valuable,” the hoped-for result might be that AI systems aim (in the sense described above) toward producing technologies we find valuable; but the actual result might be that they aim for some other set of things that is correlated with (but not the same as) the thing we intended them to aim for.

There are a lot of things they might end up aiming for, such as:

Power and resources. These tend to be useful for most goals, such that AI systems could be quite consistently be getting better reinforcement when they habitually pursue power and resources.
Things like “digital representations of human approval” (after all, every time an AI gets positive reinforcement, there’s a digital representation of human approval).

More: Why would AI "aim" to defeat humanity?

END OF HYPOTHETICAL SCENARIO

Potential catastrophes from aligned AI

I think it’s possible that misaligned AI (AI forming dangerous goals of its own) will turn out to be pretty much a non-issue. That is, I don’t think the argument I’ve made for being concerned is anywhere near watertight.

What happens if you train an AI system by trial-and-error, giving (to oversimplify) a “thumbs-up” when you’re happy with its behavior and a “thumbs-down” when you’re not? I’ve argued that you might be training it to deceive and manipulate you. However, this is uncertain, and - especially if you’re able to avoid errors in how you’re giving it feedback - things might play out differently.

It might turn out that this kind of training just works as intended, producing AI systems that do something like “Behave as the human would want, if they had all the info the AI has.” And the nitty-gritty details of how exactly AI systems are trained (beyond the high-level “trial-and-error” idea) could be crucial.

If this turns out to be the case, I think the future looks a lot brighter - but there are still lots of pitfalls of the kind I outlined in this piece. For example:

Perhaps an authoritarian government launches a huge state project to develop AI systems, and/or uses espionage and hacking to steal a cutting-edge AI model developed elsewhere and deploy it aggressively.
- I previously noted that “developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others’.”
- So this could put an authoritarian government in an enormously powerful position, with the ability to surveil and defeat any enemies worldwide, and the ability to prolong the life of its ruler(s) indefinitely. This could lead to a very bad future, especially if (as I’ve argued could happen) the future becomes “locked in” for good.
Perhaps AI companies race ahead with selling AI systems to anyone who wants to buy them, and this leads to things like:
- People training AIs to act as propaganda agents for whatever views they already have, to the point where the world gets flooded with propaganda agents and it becomes totally impossible for humans to sort the signal from the noise, educate themselves, and generally make heads or tails of what’s going on. (Some people think this has already happened! I think things can get quite a lot worse.)
- People training “scientist AIs” to develop powerful weapons that can’t be defended against (even with AI help),^[5] leading eventually to a dynamic in which ~anyone can cause great harm, and ~nobody can defend against it. At this point, it could be inevitable that we’ll blow ourselves up.
- Science advancing to the point where digital people are created, in a rushed way such that they are considered property of whoever creates them (no human rights). I’ve previously written about how this could be bad.
- All other kinds of chaos and disruption, with the least cautious people (the ones most prone to rush forward aggressively deploying AIs to capture resources) generally having an outsized effect on the future.

Of course, this is just a crude gesture in the direction of some of the ways things could go wrong. I’m guessing I haven’t scratched the surface of the possibilities. And things could go very well too!

We can do better

In previous pieces, I’ve talked about a number of ways we could do better than in the scenarios above. Here I’ll just list a few key possibilities, with a bit more detail in expandable boxes and/or links to discussions in previous pieces.

Strong alignment research (including imperfect/temporary measures). If we make enough progress ahead of time on alignment research, we might develop measures that make it relatively easy for AI companies to build systems that truly (not just seemingly) are safe.

So instead of having to say things like “We should slow down until we make progress on experimental, speculative research agendas,” person B in the above dialogue can say things more like “Look, all you have to do is add some relatively cheap bells and whistles to your training procedure for the next AI, and run a few extra tests. Then the speculative concerns about misaligned AI will be much lower-risk, and we can keep driving down the risk by using our AIs to help with safety research and testing. Why not do that?”

More on what this could look like at a previous piece, High-level Hopes for AI Alignment.

(Click to expand) High-level hopes for AI alignment

A previous piece goes through what I see as three key possibilities for building powerful-but-safe AI systems.

It frames these using Ajeya Cotra’s young businessperson analogy for the core difficulties. In a nutshell, once AI systems get capable enough, it could be hard to test whether they’re safe, because they might be able to deceive and manipulate us into getting the wrong read. Thus, trying to determine whether they’re safe might be something like “being an eight-year-old trying to decide between adult job candidates (some of whom are manipulative).”

Key possibilities for navigating this challenge:

Digital neuroscience: perhaps we’ll be able to read (and/or even rewrite) the “digital brains” of AI systems, so that we can know (and change) what they’re “aiming” to do directly - rather than having to infer it from their behavior. (Perhaps the eight-year-old is a mind-reader, or even a young Professor X.)
Limited AI: perhaps we can make AI systems safe by making them limited in various ways - e.g., by leaving certain kinds of information out of their training, designing them to be “myopic” (focused on short-run as opposed to long-run goals), or something along those lines. Maybe we can make “limited AI” that is nonetheless able to carry out particular helpful tasks - such as doing lots more research on how to achieve safety without the limitations. (Perhaps the eight-year-old can limit the authority or knowledge of their hire, and still get the company run successfully.)
AI checks and balances: perhaps we’ll be able to employ some AI systems to critique, supervise, and even rewrite others. Even if no single AI system would be safe on its own, the right “checks and balances” setup could ensure that human interests win out. (Perhaps the eight-year-old is able to get the job candidates to evaluate and critique each other, such that all the eight-year-old needs to do is verify basic factual claims to know who the best candidate is.)

These are some of the main categories of hopes that are pretty easy to picture today. Further work on AI safety research might result in further ideas (and the above are not exhaustive - see my more detailed piece, posted to the Alignment Forum rather than Cold Takes, for more).

Standards and monitoring. A big driver of the hypothetical catastrophe above is that each individual AI project feels the need to stay ahead of others. Nobody wants to unilaterally slow themselves down in order to be cautious. The situation might be improved if we can develop a set of standards that AI projects need to meet, and enforce them evenly - across a broad set of companies or even internationally.

This isn’t just about buying time, it’s about creating incentives for companies to prioritize safety. An analogy might be something like the Clean Air Act or fuel economy standards: we might not expect individual companies to voluntarily slow down product releases while they work on reducing pollution, but once required, reducing pollution becomes part of what they need to do to be profitable.

Standards could be used for things other than alignment risk, as well. AI projects might be required to:

Take strong security measures, preventing states from capturing their models via espionage.
Test models before release to understand what people will be able to use them for, and (as if selling weapons) restrict access accordingly.

More at a previous piece.

(Click to expand) How standards might be established and become national or international

I previously laid out a possible vision on this front, which I’ll give a slightly modified version of here:

Today’s leading AI companies could self-regulate by committing not to build or deploy a system that they can’t convincingly demonstrate is safe (e.g., see Google’s 2018 statement, "We will not design or deploy AI in weapons or other technologies whose principal purpose or implementation is to cause or directly facilitate injury to people”).
- Even if some people at the companies would like to deploy unsafe systems, it could be hard to pull this off once the company has committed not to.
- Even if there’s a lot of room for judgment in what it means to demonstrate an AI system is safe, having agreed in advance that certain evidence is not good enough could go a long way.
As more AI companies are started, they could feel soft pressure to do similar self-regulation, and refusing to do so is off-putting to potential employees, investors, etc.
Eventually, similar principles could be incorporated into various government regulations and enforceable treaties.
Governments could monitor for dangerous projects using regulation and even overseas operations. E.g., today the US monitors (without permission) for various signs that other states might be developing nuclear weapons, and might try to stop such development with methods ranging from threats of sanctions to cyberwarfare or even military attacks. It could do something similar for any AI development projects that are using huge amounts of compute and haven’t volunteered information about whether they’re meeting standards.

Successful, careful AI projects. I think a single AI company, or other AI project, could enormously improve the situation by being both successful and careful. For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use. It can also take a variety of other measures laid out in a previous piece.

(Click to expand) How a careful AI project could be helpful

In addition to using advanced AI to do AI safety research (noted above), an AI project could:

Put huge effort into designing tests for signs of danger, and - if it sees danger signs in its own systems - warning the world as a whole.
Offer deals to other AI companies/projects. E.g., acquiring them or exchanging a share of its profits for enough visibility and control to ensure that they don’t deploy dangerous AI systems.
Use its credibility as the leading company to lobby the government for helpful measures (such as enforcement of a monitoring-and-standards regime), and to more generally highlight key issues and advocate for sensible actions.
Try to ensure (via design, marketing, customer choice, etc.) that its AI systems are not used for dangerous ends, and are used on applications that make the world safer and better off. This could include defensive deployment to reduce risks from other AIs; it could include using advanced AI systems to help it gain clarity on how to get a good outcome for humanity; etc.

An AI project with a dominant market position could likely make a huge difference via things like the above (and probably via many routes I haven’t thought of). And even an AI project that is merely one of several leaders could have enough resources and credibility to have a lot of similar impacts - especially if it’s able to “lead by example” and persuade other AI projects (or make deals with them) to similarly prioritize actions like the above.

A challenge here is that I’m envisioning a project with two arguably contradictory properties: being careful (e.g., prioritizing actions like the above over just trying to maintain its position as a profitable/cutting-edge project) and successful (being a profitable/cutting-edge project). In practice, it could be very hard for an AI project to walk the tightrope of being aggressive enough to be a “leading” project (in the sense of having lots of resources, credibility, etc.), while also prioritizing actions like the above (which mostly, with some exceptions, seem pretty different from what an AI project would do if it were simply focused on its technological lead and profitability).

Strong security. A key threat in the above scenarios is that an incautious actor could “steal” an AI system from a company or project that would otherwise be careful. My understanding is that based on current state of security, it could be extremely hard for an AI project to be safe against this outcome. But this could change, if there’s enough effort to work out the problem of how to develop a large-scale, powerful AI system that is very hard to steal.

In future pieces, I’ll get more concrete about what specific people and organizations can do today to improve the odds of factors like these going well, and overall to raise the odds of a good outcome.

E.g., Ajeya Cotra gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from this chart on expert surveys implies a >10% chance by 2028. ↩
To predict early AI applications, we need to ask not just “What tasks will AI be able to do?” but “How will this compare to all the other ways people can get the same tasks done?” and “How practical will it be for people to switch their workflows and habits to accommodate new AI capabilities?”
By contrast, I think the implications of powerful enough AI for productivity don’t rely on this kind of analysis - very high-level economic reasoning can tell us that being able to cheaply copy something with human-like R&D capabilities would lead to explosive progress.
FWIW, I think it’s fairly common for high-level, long-run predictions to be easier than detailed, short-run predictions. Another example: I think it’s easier to predict a general trend of planetary warming (this seems very likely) than to predict whether it’ll be rainy next weekend. ↩
Here’s an early example of AIs providing training data for each other/themselves. ↩
Example of AI helping to write code. ↩
To be clear, I have no idea whether this is possible! It’s not obvious to me that it would be dangerous for technology to progress a lot and be used widely for both offense and defense. It’s just a risk I’d rather not incur casually via indiscriminate, rushed AI deployments. ↩