Racing through a minefield: the AI deployment problem

By Holden Karnofsky @ 2022-12-31T21:44 (+79)

This is a linkpost to https://www.cold-takes.com/racing-through-a-minefield-the-ai-deployment-problem/

In previous pieces, I argued that there's a real and large risk of AI systems' developing dangerous goals of their own and defeating all of humanity - at least in the absence of specific efforts to prevent this from happening. I discussed why it could be hard to build AI systems without this risk and how it might be doable.

The “AI alignment problem” refers1 to a technical problem: how can we design a powerful AI system that behaves as intended, rather than forming its own dangerous aims? This post is going to outline a broader political/strategic problem, the “deployment problem”: if you’re someone who might be on the cusp of developing extremely powerful (and maybe dangerous) AI systems, what should you … do?

The basic challenge is this:

My current analogy for the deployment problem is racing through a minefield: each player is hoping to be ahead of others, but anyone moving too quickly can cause a disaster. (In this minefield, a single mine is big enough to endanger all the racers.)

This post gives a high-level overview of how I see the kinds of developments that can lead to a good outcome, despite the “racing through a minefield” dynamic. It is distilled from a more detailed post on the Alignment Forum.

First, I’ll flesh out how I see the challenge we’re contending with, based on the premises above.

Next, I’ll list a number of things I hope that “cautious actors” (AI companies, governments, etc.) might do in order to prevent catastrophe.

Many of the actions I’m picturing are not the kind of things normal market and commercial incentives would push toward, and as such, I think there’s room for a ton of variation in whether the “racing through a minefield” challenge is handled well. Whether key decision-makers understand things like the case for misalignment risk (and in particular, why it might be hard to measure) - and are willing to lower their own chances of “winning the race” to improve the odds of a good outcome for everyone - could be crucial.

The basic premises of “racing through a minefield”

This piece is going to lean on previous pieces and assume all of the following things:

So, one can imagine a scenario where some company is in the following situation:

That seems like a tough enough, high-stakes-enough, and likely enough situation that it’s worth thinking about how one is supposed to handle it.

One simplified way of thinking about this problem:

In this setup, cautious actors need to move fast enough that they can’t be overpowered by others’ AI systems, but slowly enough that they don’t cause disaster themselves. Hence the “racing through a minefield” analogy.

What success looks like

In a non-Cold-Takes piece, I explore the possible actions available to cautious actors to win the race through a minefield. This section will summarize the general categories - and, crucially, why we shouldn’t expect that companies, governments, etc. will do the right thing simply from natural (commercial and other) incentives.

I’ll be going through each of the following:

Alignment (charting a safe path through the minefield2)

I previously wrote about some of the ways we might reduce the dangers of advanced AI systems. Broadly speaking:

A key point here is that making AI systems safe enough to commercialize (with some initial success and profits) could be much less (and different) effort than making them robustly safe (no lurking risk of global catastrophe). The basic reasons for this are covered in my previous post on difficulties with AI safety research In brief:

Well-meaning AI companies with active ethics boards might do a lot of AI safety work, by training AIs not to behave in unhelpful or dangerous ways. But if they want to address the risks I’m focused on here, this could require safety measures that look very different - e.g., measures more reliant on “checks and balances” and “digital neuroscience.”

Threat assessment (alerting others about the mines)

In addition to making AI systems safer, cautious actors can also put effort into measuring and demonstrating how dangerous they are (or aren’t).

For the same reasons given in the previous section, it could take special efforts to find and demonstrate the kinds of dangers I’ve been discussing. Simply monitoring AI systems in the real world for bad behavior might not do it. It may be necessary to examine (or manipulate) their digital brains,3 design AI systems specifically to audit other AI systems for signs of danger; deliberately train AI systems to demonstrate particular dangerous patterns (while not being too dangerous!); etc.

Learning and demonstrating that the danger is high could help convince many actors to move more slowly and cautiously. Learning that the danger is low could lessen some of the tough tradeoffs here and allow cautious actors to move forward more decisively with developing advanced AI systems; I think this could be a good thing in terms of what sorts of actors lead the way on transformative AI.

Avoiding races (to move more cautiously through the minefield)

Here’s a dynamic I’d be sad about:

(Similar dynamics could apply to Country A and B, with national AI development projects.)

If Companies A and B would both “love to move slowly and be careful” if they could, it’s a shame that they’re both racing to beat each other. Maybe there’s a way to avoid this dynamic. For example, perhaps Companies A and B could strike a deal - anything from “collaboration and safety-related information sharing” to a merger. This could allow both to focus more on precautionary measures rather than on beating the other. Another way to avoid this dynamic is discussed below, under standards and monitoring.

“Finding ways to avoid a furious race” is not the kind of dynamic that emerges naturally from markets! In fact, working together along these lines would have to be well-designed to avoid running afoul of antitrust regulation.

Selective information sharing - including security (so the incautious don’t catch up)

Cautious actors might want to share certain kinds of information quite widely:

At the same time, as long as there are incautious actors out there, information can be dangerous too:

The lines between these categories of information might end up fuzzy. Some information might be useful for demonstrating the dangers and capabilities of cutting-edge systems, or useful for making systems safer and for building them in the first place. So there could be a lot of hard judgment calls here.

This is another area where I worry that commercial incentives might not be enough on their own. For example, it is usually important for a commercial project to have some reasonable level of security against hackers, but not necessarily for it to be able to resist well-resourced attempts by states to steal its intellectual property.

Global monitoring (noticing people about to step on mines, and stopping them)

Ideally, cautious actors would learn of every case where someone is building a dangerous AI system (whether purposefully or unwittingly), and be able to stop the project. If this were done reliably enough, it could take the teeth out of the threat; a partial version could buy time.

Here’s one vision for how this sort of thing could come about:

If the situation becomes very dire - i.e., it seems that there’s a high risk of dangerous AI being deployed imminently - I see the latter bullet point as one of the main potential hopes. In this case, governments might have to take drastic actions to monitor and stop dangerous projects, based on limited information.

Defensive deployment (staying ahead in the race)

I’ve emphasized the importance of caution: not deploying AI systems when we can’t be confident enough that they’re safe.

But when confidence can be achieved (how much confidence? See footnote5), powerful-and-safe AI can help reduce risks from other actors in many possible ways.

Some of this would be by helping with all of the above. Once AI systems can do a significant fraction of the things humans can do today, they might be able to contribute to each of the activities I’ve listed so far:

Additionally, if safe AI systems are in wide use, it could be harder for dangerous (similarly powerful) AI systems to do harm. This could be via a wide variety of mechanisms. For example:

So?

I’ve gone into some detail about why we might have a challenging situation (“racing through a minefield”) if powerful AI systems (a) are developed fairly soon; (b) present significant risk of misalignment leading to humanity being defeated; (c) are not particularly easy to measure the safety of.

I’ve also talked about what I see as some of the key ways that “cautious actors” concerned about misaligned AI might navigate this situation.

I talk about some of the implications in my more detailed piece. Here I’m just going to name a couple of observations that jump out at me from this analysis:

This seems hard. If we end up in the future envisioned in this piece, I imagine this being extremely stressful and difficult. I’m picturing a world in which many companies, and even governments, can see the huge power and profit they might reap from deploying powerful AI systems before others - but we’re hoping that they instead move with caution (but not too much caution!), take the kinds of actions described above, and that ultimately cautious actors “win the race” against less cautious ones.

Even if AI alignment ends up being relatively easy - such that a given AI project can make safe, powerful systems with about 10% more effort than making dangerous, powerful systems - the situation still looks pretty nerve-wracking, because of how many different players could end up trying to build systems of their own without putting in that 10%.

A lot of the most helpful actions might be “out of the ordinary.” When racing through a minefield, I hope key actors will:

As such, it could be very important whether key decision-makers (at both companies and governments) understand the risks and are prepared to act on them. Currently, I think we’re unfortunately very far from a world where this is true.

Additionally, I think AI projects can and should be taking measures today to make unusual-but-important measures more practical in the future. This could include things like:

Footnotes

  1. Generally, or at least, this is what I’d like it to refer to. 

  2. Thanks to beta reader Ted Sanders for suggesting this analogy in place of the older one, “removing mines from the minefield.”  

  3. One genre of testing that might be interesting: manipulating an AI system’s “digital brain” in order to simulate circumstances in which it has an opportunity to take over the world, and seeing whether it does so. This could be a way of dealing with the King Lear problem. More here

  4. Modern AI systems tend to be trained with lots of trial-and-error. The actual code that is used to train them might be fairly simple and not very valuable on its own; but an expensive training process then generates a set of “weights” which are ~all one needs to make a fully functioning, relatively cheap copy of the AI system. 

  5. I mean, this is part of the challenge. In theory, you should deploy an AI system if the risks of not doing so are greater than the risks of doing so. That’s going to depend on hard-to-assess information about how safe your system is and how dangerous and imminent others’ are, and it’s going to be easy to be biased in favor of “My systems are safer than others’; I should go for it.” Seems hard. 


Aaron Chang @ 2022-12-31T18:02 (+8)

Wonder if there might be some avenue of leading groups holding equity stakes in each other as an angle of aligning incentives. Imperfect analogy is in the auto industry, for example how Toyota/ Subaru and others hold equity in each other and share best practices in safety/hybrid tech. https://www.reuters.com/article/us-toyota-subaru/toyota-strengthens-japan-partnerships-with-bigger-subaru-stake-idUSKBN1WC04E