Does AI risk “other” the AIs?
By Joe_Carlsmith @ 2024-01-09T17:51 (+22)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullMatthew_Barnett @ 2024-01-10T22:26 (+11)
(Cross-posted from LW)
I think there's an additional element of Hanson's argument that is both likely true and important, and as far as I can tell unaddressed in your post. When Hanson talks about "othering" AIs, he's often talking about the stuff you mentioned — projecting a propensity to do bad things onto the AIs — but he's also saying that future AIs won't necessarily form a natural, unified coalition against us. In other words, he's rejecting a form of out-group homogeneity used to portray AIs.
As an analogy, among humans, the class of "smart people" are not a natural coalition, even though they could in-principle get together and defeat all of the non-smart people in a one-on-one fight. Why don't smart people do that? Well, one reason is that smart people don't usually see themselves as being part of a coherent identity that includes all the other smart people. Poetically, there isn't much class consciousness among smart people as a unified group. They have diverse motives and interests that wouldn't be furthered much by attempting to join such a front. The argument Hanson makes is that AIs will also not form a natural, unified front against humans in the same sense. The relevant boundaries in future conflicts over power will likely be drawn across other lines.
The idea that AIs won't form a natural coalition has a variety of implications for the standard AI risk arguments. Most notably, it undermines the single-agent model that underlies many takeover stories and arguments for risk. More specifically, if AIs won't form a natural coalition, then,
- We shouldn't model a bunch of AIs (or even a bunch of copies of the same model) as all basically being "a single AI". For instance, in Ajeya's story about AI takeover, she alternates between calling a single copy of an AI model an entity named "Alex" and calling the entire coalition of copies of the same model "Alex". However, in the absence of any unified coalition among the AIs, this conflation makes much less sense.
- It will be hard for AIs to coordinate a violent takeover for basically the same reason why it's hard for humans to coordinate a violent takeover. In order to coordinate a violent plan, you generally need to alert other people about your intentions. However, since other people might not agree with your intentions, each person you inform is a chance for your plan to be exposed, and thus ended. This would apply to AIs who try to inform other AIs about their intentions.
- If an AI is trained to catch deception in other AIs, there isn't a strong reason to assume that it will defect from its training and join the other AI in deceiving against the humans, because it won't necessarily see itself as "an AI" fighting against "the humans".
In my opinion, these examples only scratch the surface of the ways in which your story of AI might depart from the classic AI risk analysis if you don't think AIs will form a natural, unified coalition. When you start to read standard AI risk stories (including from people like Ajeya who do not agree with Eliezer on a ton of things), you can often find the assumption that "AIs will form a natural, unified coalition" written all over it.
Nick K. @ 2024-01-11T16:29 (+1)
I'm just noting that you are assuming that we have many robustly aligned AI's, in which case I agree that take-over seems less likely.
Absent this assumption, I don't think that "AIs will form a natural, unified coalition" is the necessary outcome, but it seems reasonable that the other outcomes will look functionally the same for us.
Vasco Grilo @ 2024-01-14T15:27 (+2)
Nice post, Joe.
For two: clearly some sorts of otherness warrant some sorts of fear. For example: maybe you, personally, don't like to murder. But Bob, well: Bob is different. If Bob gets a bunch of power, then: yep, it's OK to hold your babies close. And often OK, too, to try to "control" Bob into not-killing-your-babies. Cf, also, the discussion of getting-eaten-by-bears in the first essay. And the Nazis, too, were different in their own way. Of course, there's a long and ongoing history of mistaking "different" for "the type of different that wants to kill your babies." We should, indeed, be very wary. But liberal tolerance has never been a blank check; and not all fear is hatred.
I think the strength of this point may be misleading. The word "murder" usually has a negative connotation in language, so if one hears that "a murderer got lots of power", then one should reasonably expect more negative stuff. However, although killing humans is almost always bad, it is not necessarily bad. For example, I think it would be totally fine to kill a terrorist to prevent the death of e.g. 1 M people. To assess the value of a given action, we have to ask what is better from the point of view of the universe. Once one does that, AI killing all humans is no longer necessarily bad, but it will also not be good for free.