Value fragility and AI takeover

By Joe_Carlsmith @ 2024-08-05T21:28 (+35)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Owen Cotton-Barratt @ 2024-08-07T10:14 (+2)

I agree that people often have more intuitions about fragility in unipolar cases vs in multipolar cases. I'm not sure that I fully understand what's going on here, but my current guesses are mostly in the vicinity of your "something good comes from people rubbing up against each other" point, and are something like:

Matthew_Barnett @ 2024-08-05T22:15 (+2)

In general, to me it seems quite fruitful to examine in more detail whether, in fact, multipolarity of various kinds might alleviate concerns about value fragility. And to those who have the intuition that it would (especially in cases, like Multipolar value fragility, where agent A’s exact values aren’t had by any of agents 1-n), I’d be curious to hear the case spelled out in more detail.

Here's a case that I roughly believe: multipolarity means that there's a higher likelihood that one's own values will be represented because it gives them the opportunity to literally live in, and act in the world to bring about outcomes they personally want.

This case is simple enough, and it's consistent with the ordinary multipolarity the world already experiences. Consider an entirely selfish person. Now, divide the world into two groups: the selfish person (which we call Group A) and the rest of the world (which we call Group B). 

Group A and Group B have very different values, even "upon reflection". Group B is also millions or billions of times more powerful than Group A (as it comprises the entire world minus the selfish individual). Therefore, on a naive analysis, you might expect Group B to "take over the world" and then implement its values without any regard whatsoever to Group A. Indeed, because of the vast power differential, it would be "very easy" for Group B to achieve this world takeover. And such an outcome would indeed be very bad according to Group A's values.

Of course, this naive analysis is flawed, because the real world is multipolar in an important respect: usually, Group B will let Group A (the individual) have some autonomy, and let them receive a tiny fraction of the world's resources, rather than murdering Group A and taking all their stuff. They will do this because of laws, moral norms, and respect for one's fellow human. This multipolarity therefore sidesteps all the issues with value fragility, and allows Group A to achieve a pretty good outcome according to their values.

This is also my primary hope with misaligned AI. Even if misaligned AIs are collectively millions or billions of times more powerful than humans (or aligned AIs), I would hope they would still allow the humans or aligned AIs to have some autonomy, leave us alone, and let us receive a sufficient fraction of resources that we can enjoy an OK outcome, according to our values.

SummaryBot @ 2024-08-06T14:51 (+1)

Executive summary: Value fragility between AI systems and humans could make AI takeover incentives worse, but commitments to niceness and power-sharing may help diffuse adversarial dynamics.

Key points:

  1. Value fragility is the idea that slightly different value systems lead to very different outcomes under extreme optimization. The author distinguishes this from the claim that agents with slightly different values will converge to different values on reflection.
  2. In a multipolar scenario where no single agent has the exact right values, value fragility still implies the future will be low value for any given agent. The author is uncertain whether multipolarity actually alleviates value fragility concerns.
  3. Value fragility is not necessary for arguing that an AI would try to take over the world if doing so is extremely easy. But it could increase AI takeover incentives in cases where takeover is harder, by making the AI expect a human-run world to be low value.
  4. Commitments to "niceness" (intrinsically valuing how things go according to another agent's values) and intentional power-sharing could help diffuse adversarial dynamics between AIs and humans that stem from value fragility.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.