A Playbook for AI Risk Reduction (focused on misaligned AI)

By Holden Karnofsky @ 2023-06-06T18:05 (+81)

I sometimes hear people asking: “What is the plan for avoiding a catastrophe from misaligned AI?”

This post gives my working answer to that question - sort of. Rather than a plan, I tend to think of a playbook.1

Below I’ll briefly recap my overall picture of what success might look like (with links to other things I’ve written on this), then discuss four key categories of interventions: alignment research, standards and monitoring, successful-but-careful AI projects, and information security. For each, I’ll lay out:

Overall, I feel like there is a pretty solid playbook of helpful interventions - any and all of which can improve our odds of success - and that working on those interventions is about as much of a “plan” as we need for the time being.

The content in this post isn’t novel, but I don’t think it’s already-consensus: two of the four interventions (standards and monitoring; information security) seem to get little emphasis from existential risk reduction communities today, and one (successful-but-careful AI projects) is highly controversial and seems often (by this audience) assumed to be net negative.

Many people think most of the above interventions are doomed, irrelevant or sure to be net harmful, and/or that our baseline odds of avoiding a catastrophe are so low that we need something more like a “plan” to have any hope. I have some sense of the arguments for why this is, but in most cases not a great sense (at least, I can’t see where many folks’ level of confidence is coming from). A lot of my goal in posting this piece is to give such people a chance to see where I’m making steps that they disagree with, and to push back pointedly on my views, which could change my picture and my decisions.

As with many of my posts, I don’t claim personal credit for any new ground here. I’m leaning heavily on conversations with others, especially Paul Christiano and Carl Shulman.

My basic picture of what success could look like

I’ve written a number of nearcast-based stories of what it might look like to suffer or avoid an AI catastrophe. I’ve written a hypothetical “failure story” (How we might stumble into AI catastrophe); two “success stories” that assume good decision-making by key actors; and an outline of how we might succeed with “minimal dignity.”

The essence of my picture has two phases:

  1. Navigate the initial alignment problem:2 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. It’s also plausible that it’s fiendishly hard.
  2. Navigating the deployment problem:3 reducing the risk that someone in the world will deploy dangerous systems, even though the basic technology exists to make powerful (human-level-ish) AIs safe. (This is often discussed through the lens of “pivotal acts,” though that’s not my preferred framing.4)
    1. You can think of this as containing two challenges: stopping misaligned human-level-ish AI, and maintaining alignment as AI goes beyond human level.
    2. The basic hope (discussed here) is that “safe actors”5 team up to the point where they outnumber and slow/stop “unsafe actors,” via measures like standards and monitoring - as well as alignment research (to make it easier for all actors to be effectively “cautious”), threat assessment research (to turn incautious actors cautious), and more.
    3. If we can get aligned human-level-ish AI, it could be used to help with all of these things, and a small lead for “cautious actors” could turn into a big and compounding advantage. More broadly, the world will probably be transformed enormously, to the point where we should consider ~all outcomes in play.

4 key categories of interventions

Here I’ll discuss the potential impact of both small and huge progress on each of 4 major categories of interventions.

For more detail on interventions, see Jobs that can help with the most important century;

What AI companies can do today to help with the most important century; and How major governments can help with the most important century.

Alignment research

How a small improvement from the status quo could nontrivially improve our odds. I think there are various ways we could “get lucky” such that aligning at least the first human-level-ish AIs is relatively easy, and such that relatively small amounts of progress make the crucial difference.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. The big win here would be some alignment (or perhaps threat assessment) technique that is both scalable (works even for systems with far-beyond-human capabilities) and cheap (can be used by a given AI lab without having to pay a large “alignment tax”). This seems pretty unlikely to be imminent, but not impossible, and it could lead to a world where aligned AIs heavily outnumber misaligned AIs (a key hope).

Concerns and reservations. Quoting from a previous piece, three key reasons people give for expecting alignment to be very hard are:

Standards and monitoring

How a small improvement from the status quo could nontrivially improve our odds. Imagine that:

I think this kind of situation would bring major benefits to the status quo, if only via incentives for top AI labs to move more carefully and invest more energy in alignment. Even a squishy, gameable standard, accompanied by mostly-theoretical possibilities of future regulation and media attention, could add to the risks (bad PR, employee dissatisfaction, etc.) and general pain of scaling up and releasing models that can’t be shown to be safe.

This could make it more attractive for companies to do their best with less capable models while making serious investments in alignment work (including putting more of the “results-oriented leadership effort” into safety - e.g., “We really need to make better alignment progress, where are we on that?” as opposed to “We have a big safety team, what more do you want?”) And it could create a big financial “prize” for anyone (including outside of AI companies) who comes up with an attractive approach to alignment.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. A big potential win is something like:

Concerns and reservations. A common class of concerns is along the lines of, “Any plausible standards would be squishy/gameable”; I think this is significantly true, but squishy/gameable regulations can still affect behavior a lot.9

Another concern: standards could end up with a dynamic like “Slowing down relatively cautious, high-integrity and/or law-abiding players, allowing less cautious players to overtake them.” I do think this is a serious risk, but I also think we could easily end up in a world where the “less cautious” players have trouble getting top talent and customers, which does some combination of slowing them down and getting them to adopt standards of their own (perhaps weaker ones, but which still affect their speed and incentives). And I think the hope of affecting regulation is significant here.

I think there’s a pretty common misconception that standards are hopeless internationally because international cooperation (especially via treaty) is so hard. But there is precedent for the US enforcing various things on other countries via soft power, threats, cyberwarfare, etc. without treaties or permission, and in a high-stakes scenario, it could do quite a lot of this..

Successful, careful AI lab

Conflict of interest disclosure: my wife is co-founder and President of Anthropic and owns significant equity in both Anthropic and OpenAI. This may affect my views, though I don't think it is safe to assume specific things about my takes on specific AI labs due to this.10

How a small improvement from the status quo could nontrivially improve our odds. If we just imagine an AI lab that is even moderately competitive on capabilities while being substantially more concerned about alignment than its peers, such a lab could:

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. If an AI lab ends up with a several-month “lead” on everyone else, this could enable huge amounts of automated alignment research, threat assessment (which could create very strong demonstrations of risk in the event that automated alignment research isn’t feasible), and other useful tasks with initial human-level-ish systems.

Concerns and reservations. This is a tough one. AI labs can do ~unlimited amounts of harm, and it currently seems hard to get a reliable signal from a given lab’s leadership that it won’t. (Up until AI systems are actually existentially dangerous, there’s ~always an argument along the lines of “We need to move as fast as possible and prioritize fundraising success today, to stay relevant so we can do good later.”) If you’re helping an AI lab “stay in the race,” you had better have done a good job deciding how much you trust leadership, and I don’t see any failsafe way to do that.

That said, it doesn’t seem impossible to me to get this right-ish (e.g., I think today’s conventional wisdom about which major AI labs are “good actors” on a relative basis is neither uninformative (in the sense of rating all labs about the same) nor wildly off), and if you can, it seems like there is a lot of good that can be done by an AI lab.

I’m aware that many people think something like “Working at an AI lab = speeding up the development of transformative AI = definitely bad, regardless of potential benefits,” but I’ve never seen this take spelled out in what seems like a convincing way, especially since it’s pretty easy for a lab’s marginal impact on speeding up timelines to be small (see above).

I do recognize a sense in which helping an AI lab move forward with AI development amounts to “being part of the problem”: a world in which lots of people are taking this action seems worse than a world in which few-to-none are. But the latter seems off the table, not because of Molochian dynamics or other game-theoretic challenges, but because most of the people working to push forward AI simply don’t believe in and/or care about existential risk ~at all (and so their actions don’t seem responsive in any sense, including acausally, to how x-risk-concerned folks weigh the tradeoffs). As such, I think “I can’t slow down AI that much by staying out of this, and getting into it seems helpful on balance” is a prima facie plausible argument that has to be weighed on the merits of the case rather than dismissed with “That’s being part of the problem.”

I think helping out AI labs is the trickiest and highest-downside intervention on my list, but it seems quite plausibly quite good in many cases.

Information security

How a small improvement from the status quo could nontrivially improve our odds. It seems to me that the status quo in security is rough (more), and I think a small handful of highly effective security people could have a very large marginal impact. In particular, it seems like it is likely feasible to make it at least difficult and unreliable for a state actor to steal a fully-developed powerful AI system.

How a big enough success at the intervention could put us in a very good position, even if the other three interventions are going poorly. I think this doesn’t apply so much here, except for a potential somewhat far-fetched case in which someone develops (perhaps with assistance from early powerful-but-not-strongly-superhuman AIs) a surprisingly secure environment that can contain even misaligned AIs significantly (though probably not unboundedly) more capable than humans.

Concerns and reservations. My impression is that most people who aren’t excited about security think one of these things:

  1. The situation is utterly hopeless - there’s no path to protecting an AI from being stolen.
  2. Or: this isn’t an area to focus on because major AI labs can simply hire non-x-risk-motivated security professionals, so why are we talking about this?

I disagree with #2 for reasons given here (I may write more on this topic in the future).

I disagree with #1 as well.

Notes


  1. After drafting this post, I was told that others had been making this same distinction and using this same term in private documents. I make no claim to having come up with it myself! 

  2. Phase 1 in this analysis 

  3. Phase 2 in this analysis 

  4. I think there are ways things could go well without any particular identifiable “pivotal act”; see the “success stories” I linked for more on this. 

  5. “Safe actors” corresponds to “cautious actors” in this post. I’m using a different term here because I want to include the possibility that actors are safe mostly due to luck (slash cheapness of alignment) rather than caution per se. 

  6. The latter, more dangerous possibility seems more likely to me, but it seems quite hard to say. (There could also of course be a hybrid situation, as the number and capabilities of AI grow.) 

  7. In the judgment of an auditor, and/or an internal evaluation that is stress-tested by an auditor, or simply an internal evaluation backed by the risk that inaccurate results will result in whistleblowing

  8. I.e, given access to its own weights, it could plausibly create thousands of copies of itself with tens of millions of dollars at their disposal, and make itself robust to an attempt by a few private companies to shut it down. 

  9. A comment from Carl Shulman on this point that seems reasonable: "A key difference here seems to be extremely rapid growth, where year on year effective compute grows 4x or more. So a defector with 1/16th the resources can produce the same amount of danger in 1-2 years, sooner if closer to advanced AGI and growth has accelerated. The anti-nuclear and anti-GMO movements cut adoption of those technologies by more than half, but you didn't see countries with GMO crops producing all the world's food after a few years, or France making so much nuclear power that all electricity-intensive industries moved there.

    For regulatory purposes you want to know if the regulation can block an AI capabilities explosion. Otherwise you're buying time for a better solution like intent alignment of advanced AI, and not very much time. That time is worthwhile, because you can perhaps get alignment or AI mind-reading to work in an extra 3 or 6 or 12 months. But the difference with conventional regulation interfering with tech is that the regulation is offsetting exponential growth; exponential regulatory decay only buys linear delay to find longer-term solutions.

    There is a good case that extra months matter, but it's a very different case from GMO or nuclear power. [And it would be far more to the credit of our civilization if we could do anything sensible at scale before the last few months or years.]" 

  10. We would still be married even if I disagreed sharply with Anthropic’s strategy. In general, I rarely share my views on specific AI labs in public. 


Sol3:2 @ 2023-06-07T07:14 (+41)

Conflict of interest disclosure: my wife is co-founder and President of Anthropic. Please don’t assume things about my takes on specific AI labs due to this.10


This is really amazing. How much of Anthropic does Daniela own? How much does your brother in law own? If my family was in line to become billionaires many times over due to a certain AI lab becoming successful, this would certainly affect my takes.

finnhambly @ 2023-06-07T11:27 (+22)

Why is this getting downvoted? This comment seems plainly helpful; it's an important thing to highlight.

blueberry @ 2023-06-07T11:23 (+5)

Disclosing a conflict of interest demonstrates explicit awareness of potential bias. It's often done to make sure the reader tries to weigh the merits of the content by itself. Your comment shows me that you have (perhaps) not done so, by ignoring the points the author argued. If you see any evidence of bias in the takes in the article/post, can you be more specific? That way, the author is given an honest chance to defend his viewpoint.

finnhambly @ 2023-06-07T11:30 (+36)

I don't think this disclosure shows that much awareness, as the notes seem to dismiss it as a problem, unless I'm misunderstanding what Holden means by "don’t assume things about my takes on specific AI labs due to this". It sounds like he's claiming he's able to assess these things neutrally, which is quite a big claim!

Holden Karnofsky @ 2023-06-10T06:16 (+12)

Sorry, I didn't mean to dismiss the importance of the conflict of interest or say it isn't affecting my views.

I've sometimes seen people reason along the lines of "Since Holden is married to Daniela, this must mean he agrees with Anthropic on specific issue X," or "Since Holden is married to Daniela, this must mean that he endorses taking a job at Anthropic in specific case Y." I think this kind of reasoning is unreliable and has been incorrect in more than one specific case. That's what I intended to push back against.

Rebecca @ 2024-03-16T12:05 (+2)

It's often done to make sure the reader tries to weigh the merits of the content by itself.

My understanding is that it's usually meant to serve the opposite purpose: to alert readers to the possibility of bias so they can evaluate the content with that in mind and decide for themselves whether they think bias has creeped in. The alternative is people being alerted to the CoI in the comments and being angry the quite relevant information being kept from them, not that they would otherwise still know about the bias and not be able to evaluate the article well because of it.

Sol3:2 @ 2023-06-07T11:37 (+1)

But he did not, in fact, disclose the conflict of interest. "My wife is President of Anthropic" means nothing in and of itself without some good idea of what stake she actually owns. 

Holden Karnofsky @ 2023-06-10T06:21 (+8)

I expected readers to assume that my wife owned significant equity in Anthropic; I've now edited the post to state this explicitly (and also added a mention of her OpenAI equity, which I should've included before and have included in the past). I don't plan to disclose the exact amount and don't think this is needed for readers to have sufficient context on my statements here.

Sol3:2 @ 2023-06-15T09:10 (+11)

I very strongly disagree. There's a huge difference between $1 billion (Jaan Tallinn money) and many tens of billions (Dustin Moskowitz money or even Elon money). You know this better than anyone. Jaan is a smart guy and can spend his money carefully and well to achieve some cool things, but Dustin can quite literally single-handedly bankroll an entire elite movement across multiple countries. Money is power even if - especially if! - you plan to give it away. The exact amount that Dario and Daniela - and of course you by association - might wind up with here is extremely relevant. If it's just a billion or two even if Anthropic does very well, fair enough, I wouldn't expect that to influence your judgment much at all given that you already exercise substantial control over the dispersion of sums many times in excess of that. If it's tens of billions, this is a very different story and we might reasonable assume that this would in fact colour your opinions concerning Anthropic. Scale matters!

michel @ 2023-06-09T11:31 (+6)

Thanks for sharing this! I think it's great you made this public.

blueberry @ 2023-06-06T20:53 (+5)

A single really convincing demonstration of something like deceptive alignment could make a big difference to the case for standards and monitoring (next section).

This struck me as a particularly good example of a small improvement having a meaningful impact. On a personal note, seeing the example of deceptive alignment you wrote would make me immediately move to the hit-the-emergency-brakes/burn-it-all-down camp. I imagine that many would react in a similar way, which might place a lot of pressure on AI labs to collectively start implementing some strict (not just for show) standards.