An Exercise to Build Intuitions on AGI Risk

By Lauro Langosco @ 2023-06-08T11:20 (+4)

This is a linkpost to https://www.alignmentforum.org/posts/uKujHaJd2ckAKAevo/an-exercise-to-build-intuitions-on-agi-risk

Epistemic status: confident that the underlying idea is useful; less confident about the details, though they're straightforward enough that I expect they're mostly in the right direction.

TLDR: This post describes a pre-mortem-like exercise that I find useful for thinking about AGI risk. It is the only way I know of to train big-picture intuitions about what solution attempts are more or less promising and what the hard parts of the problem are. The (simple) idea is to iterate between constructing safety proposals ('builder step') and looking for critical flaws in a proposal ('breaker step').

Introduction

The way that scientists-in-training usually develop research taste is to smash their heads against reality until they have good intuitions about things like which methods tend to work, how to interpret experimental results, or when to trust their proof of a theorem. This important feedback loop is mostly absent in AGI safety research, since we study a technology that does not exist yet (AGI). As a result, it is hard to develop a good understanding of which avenues of research are most promising and what the hard bits of the problem even are.[1]

The best way I know of to approximate that feedback loop is an iterative exercise with two steps: 1) propose a solution to AGI safety, and 2) look for flaws in the proposal. The idea is simple, but most people don’t do it explicitly or don’t do it often enough.

Multiple rounds of this exercise tend to bring up details about one’s assumptions and predictions that would otherwise stay implicit or unnoticed. Writing down specific flaws of a specific proposal helps ground more general concepts like instrumental convergence or claims like ‘corrigibility is unnatural’. And after some time, the patterns in the flaws (the ‘hard bits’) become visible on their own.

I ran an earlier version of this exercise as a workshop (an important component is to discuss your ideas with others, so a workshop format is convenient). Here are the slides.

The exercise

The exercise consists of two phases:[2] a builder phase in which you write down a best guess / proposal for how we might avoid existential risk from AGI, and a breaker phase in which you dig into the details until you understand how the proposal fails.

Importantly, in the context of this exercise the only thing that counts is your own inside view, that is your own understanding of the technical or political feasibility of the proposal. You might have thoughts like “There’s smart people who have thought about this much longer than I have, and they think X; why should I disagree?”. Put that aside for now; the point is to develop your own views, and that works best when you don’t think too much about other people’s views except to inform your own thoughts.

Builder phase

Write down the proposal: a plausible story for how we might avoid human extinction or disempowerment due to AGI.[3] It doesn’t need to be very detailed yet; that comes in the breaker phase.

The proposal does not need to be purely technical; e.g. governance approaches are fair game.

Example builder phase (oracle AI): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us.

If you get stuck, i.e. you can’t come up with an AGI safety proposal:
(Don’t worry, this is a common problem).

→ Write down in broad outlines what you expect to happen if we develop AGI. If that inevitably ends badly, start with the breaker phase: describe a failure scenario, then try to find a fix.

→ Talk to an AGI optimist, if you can find one. If they have an idea that doesn’t seem to you like it has obvious flaws, start with that. Alternatively, look for written proposals like the OpenAI alignment plans.

Breaker phase

Make the proposal detailed and concrete. Try to find flaws. Adopt a security mindsetAI safety mindset.

Example breaker phase (oracle AI): Let’s say we go ahead and build an oracle AGI. What exactly are we planning to do with this oracle? If the runner-up AI lab builds an agentic AGI 6 months later, their AGI might cause a catastrophe even if we’re careful. It’s not enough for the idea to be safe; it needs to be useful for alignment somehow, or otherwise help us prevent disaster from a competitor AGI. The current proposal doesn’t say anything about how to do that, which is a critical flaw.[4]

If you get stuck, i.e. it seems like the proposal works:

→ Consider different kinds of ways the proposal might fail. A useful resource here is this very appropriately titled essay.

→ Write up your proposal and get others to critique it.

 

Iterate

If the proposal seemed promising to start with, it’s plausible that a single serious flaw will not be enough to wreck it beyond repair. If you can see a way to adapt the proposal to fix the flaw, go to step 1 and repeat.

Example fix (Oracle AI): So we need to adapt the proposal to make sure we can do something useful with the Oracle AI that prevents a less careful competitor lab from causing a disaster. Maybe an oracle can help us by evaluating our plans to convince other companies to not build AGI?

→ Adapted proposal  (Oracle AI 2): Instead of building an “agent AI” that acts in the world, we could build a system that just tries to make good predictions (an “oracle”). An oracle would be very useful and economically valuable while avoiding existential risk from AGI, because an oracle has no agency and thus no reason to act against us. We train the oracle to be good at answering questions such as “will research program X have catastrophic consequences?” and at evaluating the consequences of actions such as “talk to person X to convince them they should stop research program X”. The oracle will warn us if another lab gets close to deploying a dangerous AGI, and if so it can tell us how to convince them to stop.
 

If you get stuck, i.e. it seems like the proposal is unfixable.

→ Talk to others about your idea, in particular if you know people who are optimistic about ideas similar to the proposal you’re working with. Send them your notes and ask for opinions.

→ If that fails: congratulations, you have completed the exercise! Start again from scratch with a new idea :)

Details

Resources

Writing on AGI safety

If you decide to do this exercise, you’ll probably (depending on how much you’ve already read) find it useful to read other people’s thoughts on the topic. I’ve compiled some resources that you might find useful to read through for inspiration at various points in this exercise. The list is very incomplete - it’s just what I could come up with from the top of my head.


Breakers (criticisms of AGI Safety proposals & arguments for why safety is harder than one might otherwise think):


Builders (solution proposals):


Lists / collections of posts and papers:

Other writing on how to learn about / work in AGI safety

After I wrote this post I noticed that there’s already a post by Abram Demski that describes basically the same exercise, and later people pointed out to me that John Wentworth runs a similar exercise that is briefly described here. Both of those seem worth reading if you want more perspectives on the builder/breaker exercise, as is Paul Christiano’s post on his research methodology.


Neel Nanda has a good post on forming your own views in AGI safety.


The MIRI alignment research field guide covers some useful basics for doing research and discussion groups with others.

  1. ^

    Of course, AGI safety researchers do build research experience in adjacent fields like deep learning and maths, but there are intuitions and ways of thinking specific to AGI safety that one doesn’t typically inherit from other fields.

  2. ^

     I adopt the terms “builder / breaker” from the ELK report, though I may not be using the terms in exactly the same way.

  3. ^

    If helpful, you can choose a more concrete disaster scenario, such as “an autonomous human-level AGI breaks containment”.

  4. ^

    I'm somewhat dissatisfied with this example because the flaw is obvious enough that there's no need to go into much concrete detail. Usually you'd do more of that, e.g. if the plan is to use the oracle or 'tool-AI' to prevent a dangerous AGI from being built, how exactly might that work?