Want to win the AGI race? Solve alignment.

By leopold @ 2023-03-29T15:19 (+56)

This is a linkpost to https://www.forourposterity.com/want-to-win-the-agi-race-solve-alignment/

Society really cares about safety. Practically speaking, the binding constraint on deploying your AGI could well be your ability to align your AGI. Solving (scalable) alignment might be worth lots of $$$ and key to beating China.

Look, I really don't want Xi Jinping Thought to rule the world. If China gets AGI first, the ensuing rapid AI-powered scientific and technological progress could well give it a decisive advantage (cf potential for >30%/year economic growth with AGI). I think there's a very real specter of global authoritarianism here.[1] 

Or hey, maybe you just think AGI is cool. You want to go build amazing products and enable breakthrough science and solve the world’s problems.

So, race to AGI with reckless abandon then? At this point, people get into agonizing discussions about safety tradeoffs.[2] And many people just mood affiliate their way to an answer: "accelerate, progress go brrrr," or "AI scary, slow it down."

I see this much more practically. And, practically, society cares about safety, a lot. Do you actually think that you’ll be able to and allowed to deploy an AI system that has, say, a 10% chance of destroying all of humanity?[3]

Society has started waking up to AGI; like covid, the societal response will probably be a dumpster-fire, but it’ll also probably be quite intense. In many worlds, to deploy your AGI systems, people will need to be quite confident that your AGI won’t destroy the world.

Right now, we’re very much not on track to solve the alignment problem for superhuman AGI systems (“scalable alignment”)—but it’s a solvable problem, if we get our act together. I discuss this in my main post today (“Nobody’s on the ball on AGI alignment”). On the current trajectory, the binding constraint on deploying your AGI could well be your ability to align your AGI—and this alignment solution being unambiguous enough that there is consensus that it works.

Even if you just want to win the AGI race, you should probably want to invest much more heavily in solving this problem.


Things are going to get crazy, and people will pay attention

A mistake many people make when thinking about AGI is imagining a world that looks much like today, except for adding in a lab with a super powerful model. They ignore the endogenous societal response.

I and many others made this mistake with covid—we were freaking out in February 2020, and despairing that society didn’t seem to be even paying attention, let alone doing anything. But just a few weeks later, all of America went into an unprecedented lockdown. If we're actually on our way to AGI, things are going to get crazy. People are going to pay attention.

The wheels for this are already in motion. Remember how nobody paid any attention to AI 6 months ago, and now Bing chat/Sydney going awry is on the front page of the NYT, US senators are getting scared, and Yale econ professors are advocating $100B/year for AI safety? Well, imagine that, but 100x as we approach AGI.

AI safety is going mainstream. Everyone has been primed to be scared about rogue AI by science fiction; all the CEOs have secretly believed in AI risk for years but thought it was too weird to talk about it[4]; and the mainstream media loves to hate on tech companies. Probably there will be further, much scarier wakeup calls (not just misalignment, but also misuse and scary demos in evals). People already freaked out about GPT-4 using a TaskRabbit to solve a captcha—now imagine a demo of AI systems designing a new bioweapon or autonomously self-replicating on the internet, or people using AI coders to hack major institutions like the government or big banks. Already, a majority of the population says they fear AI risk and want FDA-style regulation in polls.

The discourse on it will be incredibly dumb—I can't wait for Ron DeSantis and Kamala Harris's 2028 presidential debate on AI safety—but you won't be able to escape it. (And as stupid as all of this will be, this sort of endogenous societal response is a big reason why I'm more optimistic on AI risk in general.)

The level of media scrutiny, public attention, internal employee pressure, self-regulation, government monitoring, etc. will be way too intense to ignore alignment concerns. We're seeing very early versions of self-regulation with initial AI risk evals efforts. But price in how intense it's all going to get. Do you think the US national security establishment won't get involved once they realize they have a technology more powerful than nukes on their hands? Do you think your board is going to let you release a model if the NYT is reporting in all caps that a large fraction of serious AI experts, prominent CEOs, and politicians think this could go haywire and start actually hurting people? 

Imagine if you tried submitting a drug application to the FDA with a similar risk profile.


A reasonable objection here is: “yes, we did lock down in response to covid, and that was pretty crazy, but also our response to covid was pretty incompetent across the board; it was more like random flailing than actually doing the most effective things; and it’s not even clear if the lockdowns were net-positive.”

I agree! The societal response to AGI will probably be a dumpster-fire.

But there will be a really intense response. I think it’s fairly likely that the cludgy response we do get is enough to throw serious sand into the gears on deployment—unless you have a convincing solution to (scalable) alignment. If anything, the example of lockdowns could point towards society responding in excessively cautious ways, heightening the returns to a convincing alignment solution even further. Yes, our response might also be totally ineffectual; this very much isn’t sufficient to make me sleep soundly at night. But in a large fraction of worlds, if you want to deploy your AGI, people are going to demand of you that we can be confident it’s safe.[5] 


The binding constraint on making AGI could be aligning it. You want an unambiguous solution, for which there is consensus that it’s safe.

You don’t even need xrisk concerns for alignment to become the binding constraint on your ability to deploy models. With current techniques, we’re very much not on track for being able to put basic guardrails on models as they become superhuman. Do you really think you’ll be able to deploy GPT-7 all across the economy if you can’t reliably ensure GPT-7 won’t break the law?

The thing is, aligning superhuman AGIs is a much harder problem than near-term alignment. Current alignment techniques rely on human supervision. But as models get superhuman, it will become impossible for humans to reliably supervise these models (e.g., imagine a model proposing a series of actions or 100,000 lines of code too complicated for humans to understand). If you can’t detect bad behavior, you can’t prevent it. (And rather than the “bad behavior” in question being “prevent the models from saying bad words,” as with near-term alignment, the bad behavior for superhuman models looks more like “prevent the models from trying a coup of the US government.”)

I think that aligning superhuman AGIs is a) doable, but b) nobody is on the ball right now—as discussed in my other post. The scalable alignment plans labs currently have (example) might work, but they sort of rely on “improvise in the moment, let’s cross our fingers and hope it works out.”

Even if that bet works out, the safety of your systems will probably be fairly ambiguous until very late—ambiguous enough that you won’t be able to deploy. When asked, “will your superhuman AGI go haywire?”, do you think people will accept “probably not?” for an answer?

If you want to win the AGI race, if you want to beat China, you’re probably going to need a better alignment plan. You want an alignment solution good enough to achieve a broad consensus that your superhuman AGI is safe. Ambiguity could be fatal to your ability to press ahead.

You might not like it, you might rage at everyone's excessive safetyism and wish it were different. But, practically speaking, you should be pretty interested in much more serious efforts to solve scalable alignment. Let’s not lose to China because in our fervor to race to AGI, we fail to invest in the alignment research practically necessary to actually deploy AGI.[6]


Thanks to Collin Burns, Holden Karnofsky and Dwarkesh Patel for comments on a draft.

 

  1. ^

     Though, for now, it seems that China is a few years behind, and the US AI chip export controls might considerably hamper them (great CSIS explainer on the export controls, CSET report on why china might have a hard time catching up). So especially if timelines are short, we have a healthy lead for now.

  2. ^

     Which risk is bigger, AI misalignment or "bad guys getting AGI first"? cf Holden Karnofsky on the "caution vs. competition" frame

  3. ^

     Or at least, it’s widely believed it has such a 10% chance.

  4. ^
  5. ^

     If this ends up being a big barrier to deploying your model in 50% of worlds, that 50% is enough to make alignment incredibly commercially valuable for you.

  6. ^

     An interesting potential implication not discussed in the main post: if alignment techniques become incredibly commercially valuable/key competitive advantages, will these become trade secrets not shared publicly or with other labs?


Akash @ 2023-03-29T18:52 (+30)

I think I agree with a lot of the specific points raised here, but I notice a feeling of wariness/unease around the overall message. I had a similar reaction to Haydn's recent "If your model is going to sell, it has to be safe" piece. Let me try to unpack this:

On one hand, I do think safety is important for the commercial interests of labs. And broadly being better able to understand/control systems seems good from a commercial standpoint.

My biggest reservations can be boiled down into two points: 

  1. I don't think that commercial incentives will be enough to motivate people to solve the hardest parts of alignment. Commercial incentives will drive people to make sure their system appears to do what users want, which is very different than having systems that actually do what users want or robustly do what users want even as they become more powerful. Or to put it another way: near-term commercial incentives don't really cause me to put appropriate amounts of attention on things like situational awareness or deceptive alignment. I think commercial incentives will be sufficient to reduce the odds of Bingchat fiascos, but I don't think they'll motivate the kind of alignment research that's trying to handle deception, sharp left turns, or even the most ambitious types of scalable oversight work.
  2. The research that is directly incentivized by commercial interests is least likely to be neglected. I expect the most neglected research to be research that doesn't have any direct commercial benefit. I expect AGI labs will invest a substantial amount of resources to prevent future Bingchat scenarios and other instances of egregious deployment harms. The problem is that I expect many of these approaches (e.g., getting really good at RLHFing your model such that it no longer displays undesirable behaviors) will not generalize to more powerful systems. I think you (and many others) agree with this, but I think the important point here is that the economic incentives will favor RLHFy stuff over stuff that tackles problems that are not as directly commercially incentivized.

As a result, even though I agree with many of your subclaims, I'm still left thinking, "huh, the message I want to spread is not something like "hey, in order to win the race or sell your product, you need to solve alignment." 

But rather something more like "hey, there are some safety problems you'll need to figure out to sell/deploy your product. Cool that you're interested in that stuff. There are other safety problems-- often ones that are more speculative-- that the market is not incentivizing companies to solve. On the margin, I want more attention paid to those problems. And if we just focus on solving the problems that are required for profit/deployment, we will likely fool ourselves into thinking that our systems are safe when they merely appear to be safe, and we may underinvest in understanding/detecting/solving some of the problems that seem most concerning from an x-risk perspective."

deep @ 2023-03-29T22:17 (+2)

There are other safety problems-- often ones that are more speculative-- that the market is not incentivizing companies to solve.

 

My personal response would be as follows: 

  1. As Leopold presents it, the key pressure here that keeps labs in check is societal constraints on deployment, not perceived ability to make money. The hope is that society's response has the following properties:
    1. thoughtful, prominent experts are attuned to these risks and demand rigorous responses
    2. policymakers are attuned to (thoughtful) expert opinion
    3. policy levers exist that provide policymakers with oversight / leverage over labs
  2. If labs are sufficiently thoughtful, they'll notice that deploying models is in fact bad for them! Can't make profit if you're dead. *taps forehead knowingly*
    1. but in practice I agree that lots of people are motivated by the tastiness of progress, pro-progress vibes, etc., and will not notice the skulls.

Counterpoints to 1:

Good regulation of deployment is hard (though not impossible in my view). 

  • reasonable policy responses are difficult to steer towards
  • attempts at raising awareness of AI risk could lead to policymakers getting too excited about the promise of AI while ignoring the risks
  • experts will differ; policymakers might not listen to the right experts

Good regulation of development is much harder, and will eventually be necessary.

This is the really tricky one IMO. I think it requires pretty far-reaching regulations that would be difficult to get passed today, and would probably misfire a lot. But doesn't seem impossible, and I know people are working on laying groundwork for this in various ways (e.g. pushing for labs to incorporate evals in their development process).

deep @ 2023-03-29T22:29 (+5)

Like Akash, I agree with a lot of the object-level points here and disagree with some of the framing / vibes. I'm not sure I can articulate the framing concerns I have, but I do want to say I appreciate you articulating the following points:

Agrippa @ 2023-10-30T07:11 (+3)

Given your position I am concerned about the arms race accelerationism messaging in this post. Substantively, the major claims of this post are "China AI progress poses a serious threat we must overcome via AI progress (that is, we are in an arms race)" and "society may regulate AI such that projects that don't meet a very high standard of safety will not be deployable". The argument is that pursuing safety follows from these premises, mostly the latter. 

This can be interpreted in a number of ways, charitably or uncharitably. Independent of that, I do not think it is really a good idea to talk this way about AI, re: geopolitics. It has a very bad track record with other stuff such as nukes, and I'm not sure who the intended audience is (are capabilities CEOs China hawks who can only be convinced to slow down if framed in terms of beating China? big if true)

Jeffrey Kursonis @ 2023-03-29T18:47 (+1)

Leopold. I love your thinking here, especially that society will arise to save itself. I sent you an email at four our posterity.