A framework for thinking about AI power-seeking

By Joe_Carlsmith @ 2024-07-24T22:41 (+44)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Matthew_Barnett @ 2024-07-25T18:14 (+6)

I'm not sure I fully understand this framework, and thus I could easily have missed something here, especially in the section about "Takeover-favoring incentives". However, based on my limited understanding, this framework appears to miss the central argument for why I am personally not as worried about AI takeover risk as most EAs seem to be.

Here's a concise summary of my own argument for being less worried about takeover risk:

  1. There is a cost to violently taking over the world, in the sense of acquiring power unlawfully or destructively with the aim of controlling everything in the whole world, relative to the alternative of simply gaining power lawfully and peacefully, even for agents that don't share 'our' values.
    1. For example, as a simple alternative to taking over the world, an AI could advocate for the right to own their own labor and then try to accumulate wealth and power lawfully by selling their services to others, which would earn them the ability to purchase a gargantuan number of paperclips without much restraint.
  2. The expected cost of violent takeover is not obviously smaller than the benefits of violent takeover, given the existence of lawful alternatives to violent takeover. This is for two main reasons:
    1. In order to wage a war to take over the world, you generally need to pay costs fighting the war, and there is a strong motive for everyone else to fight back against you if you try, including other AIs who do not want you to take over the world (and this includes any AIs whose goals would be hindered by a violent takeover, not just those who are "aligned with humans"). Empirically, war is very costly and wasteful, and less efficient than compromise, trade, and diplomacy.
    2. Violently taking over the war is very risky, since the attempt could fail, and you could be totally shut down and penalized heavily if you lose. There are many ways that violent takeover plans could fail: your takeover plans could be exposed too early, you could also be caught trying to coordinate the plan with other AIs and other humans, and you could also just lose the war. Ordinary compromise, trade, and diplomacy generally seem like better strategies for agents that have at least some degree of risk-aversion.
  3. There isn't likely to be "one AI" that controls everything, nor will there likely be a strong motive for all the silicon-based minds to coordinate as a unified coalition against the biological-based minds, in the sense of acting as a single agentic AI against the biological people. Thus, future wars of world conquest (if they happen at all) will likely be along different lines than AI vs. human. 
    1. For example, you could imagine a coalition of AIs and humans fighting a war against a separate coalition of AIs and humans, with the aim of establishing control over the world. In this war, the "line" here is not drawn cleanly between humans and AIs, but is instead drawn across a different line. As a result, it's difficult to call this an "AI takeover" scenario, rather than merely a really bad war.
  4. Nothing about this argument is intended to argue that AIs will be weaker than humans in aggregate, or individually. I am not claiming that AIs will be bad at coordinating or will be less intelligent than humans. I am also not saying that AIs won't be agentic or that they won't have goals or won't be consequentialists, or that they'll have the same values as humans. I'm also not talking about purely ethical constraints: I am referring to practical constraints and costs on the AI's behavior. The argument is purely about the incentives of violently taking over the world vs. the incentives to peacefully cooperate within a lawful regime, between both humans and other AIs.
  5. A big counterargument to my argument seems well-summarized by this hypothetical statement (which is not an actual quote, to be clear): "if you live in a world filled with powerful agents that don't fully share your values, those agents will have a convergent instrumental incentive to violently take over the world from you". However, this argument proves too much. 

    We already live in a world where, if this statement was true, we would have observed way more violent takeover attempts than what we've actually observed historically.

    For example, I personally don't fully share values with almost all other humans on Earth (both because of my indexical preferences, and my divergent moral views) and yet the rest of the world has not yet violently disempowered me in any way that I can recognize.

Erich_Grunewald @ 2024-07-25T21:53 (+4)

I don't think you're wrong exactly, but AI takeover doesn't have to happen through a single violent event, or through a treacherous turn or whatever. All of your arguments also apply to the situation with H sapiens and H neanderthalensis, but those factors did not prevent the latter from going extinct largely due to the activities of the former:

  1. There was a cost to violence that humans did against neanderthals
  2. The cost of using violence was not obviously smaller than the benefits of using violence -- there was a strong motive for the neanderthals to fight back, and using violence risked escalation, whereas peaceful trade might have avoided those risks
  3. There was no one human that controlled everything; in fact, humans likely often fought against one another
  4. You allow for neanderthals to be less capable or coordinated than humans in this analogy, which they likely were in many ways

The fact that those considerations were not enough to prevent neanderthal extinction is one reason to think they are not enough to prevent AI takeover, although of course the analogy is not perfect or conclusive, and it's just one reason among several. A couple of relevant parallels include:

  • If alignment is very hard, that could mean AIs compete with us over resources that we need to survive or flourish (e.g., land, energy, other natural resources), similar to how humans competed over resources with neanderthals
  • The population of AIs may be far larger, and grow more rapidly, than the population of humans, similar to how human populations were likely larger and growing at a faster rate than those of neanderthals
Matthew_Barnett @ 2024-07-25T23:57 (+4)

I want to distinguish between two potential claims:

  1. When two distinct populations live alongside each other, sometimes the less intelligent population dies out as a result of competition and violence with the more intelligent population.
  2. When two distinct populations live alongside each other, by default, the more intelligent population generally develops convergent instrumental goals that lead to the extinction of the other population, unless the more intelligent population is value-aligned with the other population.

I think claim (1) is clearly true and is supported by your observation that Neanderthals' went extinct, but I intended to argue against claim (2) instead. (Although, separately, I think the evidence that Neanderthals' were less intelligent than homo sapiens is rather weak.)

Despite my comment above, I do not actually have much sympathy towards the claim that humans can't possibly go extinct, or that our species is definitely going to survive over the very long run in a relatively unmodified form, for the next billion years. (Indeed, perhaps like the Neanderthals, our best hope to survive in the long-run may come from merging with the AIs.)

It's possible you think claim (1) is sufficient in some sense to establish some important argument. For example, perhaps all you're intending to argue here is that AI is risky, which to be clear, I agree with.

On the other hand, I think that claim (2) accurately describes a popular view among EAs, albeit with some dispute over what counts as a "population" for the purpose of this argument, and what counts as "value-aligned". While important, claim (1) is simply much weaker than claim (2), and consequently implies fewer concrete policy prescriptions.

 I think it is important to critically examine (2) even if we both concede that (1) is true.

Owen Cotton-Barratt @ 2024-07-25T20:37 (+4)

My read is that you can apply the framework two different ways:

  • Say you're worried about any take-over-the-world actions, violent or not -- in which case this argument about the advantages of non-violent takeover is of scant comfort;
  • Say you're only worried about violent take-over-the-world actions, in which case your argument fits into the framework under "non-takeover satisfaction": how good the AI feels about its best benign alternative action.
Matthew_Barnett @ 2024-07-26T00:28 (+2)

Say you're worried about any take-over-the-world actions, violent or not -- in which case this argument about the advantages of non-violent takeover is of scant comfort;

This is reasonable under the premise that you're worried about any AI takeovers, no matter whether they're violent or peaceful. But speaking personally, peaceful takeover scenarios where AIs just accumulate power—not by cheating us or by killing us via nanobots—but instead by lawfully beating humans fair and square and accumulating almost all the wealth over time, just seem much better than violent takeovers, and not very bad by themselves.

I admit the moral intuition here is not necessarily obvious. I concede that there are plausible scenarios in which AIs are completely peaceful and act within reasonable legal constraints, and yet the future ends up ~worthless. Perhaps the most obvious scenario is the "Disneyland without children" scenario where the AIs go on to create an intergalactic civilization, but in which no one (except perhaps the irrelevant humans still on Earth) is sentient.

But when I try to visualize the most likely futures, I don't tend to visualize a sea of unsentient optimizers tiling the galaxies. Instead, I tend to imagine a transition from sentient biological life to sentient artificial life, which continues to be every bit as cognitively rich, vibrant, and sophisticated as our current world—indeed, it could be even moreso, given what becomes possible at a higher technological and population level.

Worrying about non-violent takeover scenarios often seems to me to arise simply from discrimination against non-biological forms of life, or perhaps a more general fear of rapid technological change, rather than naturally falling out as a consequence of more robust moral intuitions. 

Let me put it another way.

It is often conceded that it was good for humans to take over the world. Speaking broadly, we think this was good because we identify with humans and their aims. We belong to the "human" category of course; but more importantly, we think of ourselves as being part of what might be called the "human tribe", and therefore we sympathize with the pursuits and aims of the human species as a whole. But equally, we could identify as part of the "sapient tribe", which would include non-biological life as well as humans, and thus we could sympathize with the pursuits of AIs, whatever those may be. Under this framing, what reason is there to care much about a non-violent, peaceful AI takeover?

Owen Cotton-Barratt @ 2024-07-26T09:34 (+2)

I think that an eventual AI-driven ecosystem seems likely desirable. (Although possibly the natural conception of "agent" will be more like supersystems which include both humans and AI systems, at least for a period.)

But my alarm at nonviolent takeover persists, for a couple of reasons:

  • A feeling that some AI-driven ecosystems may be preferable to others, and we should maybe take responsibility for which we're creating rather than just shrugging
  • Some alarm that nonviolent takeover scenarios might still lead to catastrophic outcomes for humans
    • e.g. "after nonviolently taking over, AI systems decide what to do humans, this stub part of the ecosystem; they conclude that they're using too many physical resources, and it would be better to (via legitimate means!) reduce their rights and then cull their numbers, leaving a small population living in something resembling a nature reserve"
    • Perhaps my distaste at this outcome is born in part from loyalty to the human tribe? But I do think that some of it is born from more robust moral intuitions
Matthew_Barnett @ 2024-07-26T22:44 (+2)

I think I basically agree with you, and I am definitely not saying we should just shrug. We should instead try to shape the future positively, as best we can. However, I still feel like I'm not quite getting my point across. Here's one more attempt to explain what I mean.

Imagine if we achieved a technology that enabled us to build physical robots that were functionally identical to humans in every relevant sense, including their observable behavior, and their ability to experience happiness and pain in exactly the same way that ordinary humans do. However, there is just one difference between these humanoid robots and biological humans: they are made of silicon rather than carbon, and they look robotic, rather than biological.

In this scenario, it would certainly feel strange to me if someone were to suggest that we should be worried about a peaceful robot takeover, in which the humanoid robots collectively accumulate the vast majority of wealth in the world via lawful means. 

By assumption, these humanoid robots are literally functionally identical to ordinary humans. As a result, I think we should have no intrinsic reason to disprefer them receiving a dominant share of the world's wealth, versus some other subset of human-like beings. This remains true even if the humanoid robots are literally "not human", and thus their peaceful takeover is equivalent to "human disempowerment" in a technical sense.

There ultimate reason why I think one should not worry about a peaceful robot takeover in this specific scenario is because I think these humanoid robots have essentially the same moral worth and right to choose as ordinary humans, and therefore we should respect their agency and autonomy just as much as we already do for ordinary humans. Since we normally let humans accumulate wealth and become powerful via lawful means, I think we should allow these humanoid robots to do the same. I hope you would agree with me here.

Now, generalizing slightly, I claim that to be rationally worried about a peaceful robot takeover in general, you should usually be able to identify a relevant moral difference between the scenario I have just outlined and the scenario that you're worried about. Here are some candidate moral differences that I personally don't find very compelling:

  • In the humanoid robot scenario, there's no possible way the humanoid robots would ever end up killing the biological humans, since they are functionally identical to each other. In other words, biological humans aren't at risk of losing their rights and dying.
    • My response: this doesn't seem true. Humans have committed genocide against other subsets of humanity based on arbitrary characteristics before. Therefore, I don't think we can rule out that the humanoid robots would commit genocide against the biological humans either, although I agree it seems very unlikely.
  • In the humanoid robot scenario, the humanoid robots are guaranteed to have the same values as the biological humans, since they are functionally identical to biological humans.
    • My response: this also doesn't seem guaranteed. Humans frequently have large disagreements in values with other subsets of humanity. For example, China as a group has different values than the United States as a group. This difference in values is even larger if you consider indexical preferences among the members of the group, which generally overlap very little.
Owen Cotton-Barratt @ 2024-07-26T23:19 (+6)

Since we normally let humans accumulate wealth and become powerful via lawful means, I think we should allow these humanoid robots to do the same. I hope you would agree with me here.

I agree with this -- and also agree with it for various non-humanoid AI systems.

However, I see this as less about rights for systems that may at some point exist, and more about our responsibilities as the creators of those systems.

Not entirely analogous, but: suppose we had a large creche of babies whom we had been told by an oracle would be extremely influential in the world. I think it would be appropriate for us to care more than normal about their upbringing (especially if for the sake of the example we assume that upbringing can meaningfully affect character).

Owen Cotton-Barratt @ 2024-07-25T10:16 (+2)

This is great. It's kind of wild that it's attracting downvotes.

Mjreard @ 2024-07-25T17:18 (+1)

The argument for Adequate temporal horizon is somewhat hazier

 

I read you suggesting we'd be explicit about the time horizons AIs would or should consider, but it seems to me we'd want them to think very flexibly about the value of what can be accomplished over different time horizons. I agree it'd be weird if we baked "over the whole lightcone" into all the goals we had, but I think we'd want smarter-than-us AIs to consider whether the coffee they could get us in 5 minutes and one second was potentially way better than the coffee they could get in five minutes, or they could make much more money in 13 months vs a year. 

Less constrained decision-making seems more desirable here, especially if we can just have the AIs report the projected trade offs to us before they move to execution. We don't know our own utility functions that well and it's something we'd want AIs to help with, right? 

SummaryBot @ 2024-07-25T14:33 (+1)

Executive summary: A framework for analyzing AI power-seeking highlights how classic arguments for AI risk rely heavily on the assumption that AIs will be able to easily take over the world, but relaxing this assumption reveals more complex strategic tradeoffs for AIs considering problematic power-seeking.

Key points:

  1. Prerequisites for rational AI takeover include agential capabilities, goal-content structure, and takeover-favoring incentives.
  2. Classic AI risk arguments assume AIs will meet agential and goal prerequisites, be extremely capable, and have incentives favoring takeover.
  3. If AIs cannot easily take over via many paths, the incentives for power-seeking become more complex and uncertain.
  4. Analyzing specific AI motivational structures and tradeoffs is important, not just abstract properties like goal-directedness.
  5. Strategic dynamics of earlier, weaker AIs matter for improving safety with later, more powerful systems.
  6. Framework helps recast and scrutinize key assumptions in classic AI risk arguments.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.