On “first critical tries” in AI alignment

By Joe_Carlsmith @ 2024-06-05T00:19 (+29)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
titotal @ 2024-06-05T12:47 (+8)

I think you're trying way too hard to rescue a term that just kinda sucks and should probably be abandoned. There is no way to reliably tell in advance if a try is "the first critical try": we can only tell when a try is not critical, if an AI rebels and is defeated. Also, how does this deal with probabilities? does it kick in when the probability of winning is over 50%? 90%? 99%? 

The AI also doesn't know reliably whether a try is critical. It could mistakenly think it can take over the world when it can't, or it could be overcautious thinking it can't take over the world when it can. In the latter case, you could succeed completely on your "first critical try" while still having a malevolent AI that will kill you a few tries later. 

The main effect seems to be an emotive one, by evoking the idea that "we have to get it right on the first try". But the first "critical try" could be version number billion trillion, which is a lot less scary. 

I do like your decisive strategic advantage term, I think it could replace "first critical try" entirely with no losses. 

Linch @ 2024-06-05T09:47 (+4)

(sorry if the comment is unclear! Musing out loud)

Thanks for the post and the general sharpening of ideas! One potential disagreement I have with your analysis is that it seems like you tie in the "first critical try" concept with the "Decisive Strategic Advantage" (DSA) concept. But those two seem separable to me. Or rather, my understanding is that DSA motivates some of the first critical try arguments, but is not necessary for them. For example, suppose we set in motion at time t a series of actions that are irrecoverable (eg we make unaligned AIs integral to the world economy). We may not realize what we did until time t+5, at which point it's too late. 

In my understanding of the Yudkowsky/Soares framework, this is like saying "I know with 99%+ certainty that Magnus Carlsen can beat you in chess, even if I can't predict how." Similarly, the superhuman agent(s) may end up "beating" humanity through a variety of channels, and a violent/military takeover is just one example.  In that sense, the creation/deployment of those agents was the "first critical try" that we messed up, even if we hardened the world against their capability for a military coup. 

When looking at the world today, and thinking of ways that smart and amoral people expropriate resources from less intelligent people; sometimes it looks like very obviously and transparently nonconsensual or deceptive behavior. But often it looks more mundane: payday loans, money pumps like warranties and insurance for small items, casinos, student loan forgiveness, and so forth. (The harms are somewhat limited in pratice due to a mixture of a) smart humans not being that much smarter than other humans, and b) partial alignment of values).

Similarly we may end up living in a world where it eventually becomes possible for either agents or processes to wrest control from humanity. In that world, whether we have a "first critical try" or multiple tries depends then on specific empirical details of how many transition points there are, and which ones end up in practice being preventable. 

SummaryBot @ 2024-06-05T12:59 (+1)

Executive summary: The notion of needing to get AI alignment right on the "first critical try" can refer to several different scenarios involving AI systems gaining decisive strategic advantages, each with different prospects for avoidance and different requirements for leading to existential catastrophe.

Key points:

  1. A "unilateral DSA" is when a single AI agent could take over the world if it tried, even without cooperation from other AIs. Avoiding this requires keeping the world sufficiently empowered relative to individual AI systems.
  2. A "coordination DSA" is when a set of AI agents could coordinate to take over the world if they tried. This is harder to avoid than unilateral DSAs due to likely reliance on AI agents to constrain each other, but could be delayed by preventing coordination between AIs.
  3. A "short-term correlation DSA" is when a set of AI agents seeking power in problematic ways within a short time period, without coordinating, would disempower humanity. This is even harder to avoid than coordination DSAs.
  4. A "long-term correlation DSA" is similar but with a longer time window, making it easier to avoid than short-term correlation DSAs by allowing more time to notice and correct instances of power-seeking.
  5. The worryingness of each type of DSA depends heavily on the difficulty of making AIs robustly aligned. Not being able to learn enough about AI motivations in lower-stakes testing is a key concern.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.