Project ideas: Backup plans & Cooperative AI

By Lukas Finnveden @ 2024-01-04T07:26 (+25)

This is a linkpost to https://lukasfinnveden.substack.com/p/project-ideas-backup-plans-and-cooperative

This is part of a series of lists of projects. The unifying theme is that the projects are not targeted at solving alignment or engineered pandemics but still targeted at worlds where transformative AI is coming in the next 10 years or so. See here for the introductory post.

In this final post, I include two categories of projects (which are related, and each of which I have less to say about than the previous areas). Backup plans for misaligned AI and Cooperative AI.

Backup plans for misaligned AI

When humanity builds powerful AI systems, I hope that those systems will be safe and aligned. (And I’m excited about efforts to make that happen.)

But it’s possible that alignment will be very difficult and that there won’t be any successful coordination effort to avoid building powerful misaligned AI. If misaligned AI will be built and seize power (or at least have the option of doing so), then there are nevertheless certain types of misaligned systems that I would prefer over others. This section is about affecting that.

This decomposes into two questions:

The first is addressed in What properties would we prefer misaligned AIs to have? The second is addressed in Studying generalization & AI personalities to find easily-influenceable properties. (If you’re more skeptical about the second of these than the first, you should feel free to read that section first.)

What properties would we prefer misaligned AIs to have? [Philosophical/conceptual] [Forecasting]

There are a few different plausible categories here. All of them could use more analysis on which directions would be good and which would be bad.

Making misaligned AI have better interactions with other actors

This overlaps significantly with cooperative AI. The idea is that there are certain dispositions that AIs could have that would lead them to have better interactions with other actors who are comparably powerful.

When I say “better” interactions, I mean that the interactions will leave the other actors better off (by their own lights). Especially insofar as this happens via positive-sum interactions that don’t significantly disadvantage the AI system we’re influencing.

Here are some examples of who these “other actors” could be:

So what are these “dispositions” that could lead AIs to have better interactions with other comparably powerful actors? Some candidates are:

AIs that we may have moral or decision-theoretic reasons to empower 

The main difference between this section and the above section is that the motivations in this section are one step closer to being about empowering the AIs for “their own sake” rather than for the sake of someone they interact with. Though it still includes pragmatic/decision-theoretic reasons for why it’s good to satisfy certain AI systems’ values.

One direction to think about is the scheme that Paul Christiano suggests in When is unaligned AI morally valuable?

Another direction is the ideas that I talk about in ECL with AI. Basically:

Another direction is to think about object-level things that humans value, as well as the process that produced our values, and try to get AI systems more inclined to value similar things. I’m somewhat skeptical of this path since human values seem complex, and so I’m not sure what schemes could plausibly make AIs share a significant fraction of human values without us also having the capability of making the AIs corrigible or otherwise safe.[4] But it doesn’t seem unreasonable to think about it more.

(To reiterate what I said above: I think that all of these schemes would be significantly worse than successfully building AI systems that are aligned and corrigible to human intentions.)

Making misaligned AI positively inclined toward us

A final way in which we might want to shape the preferences of misaligned AIs is to make them more likely to care enough about humans to give us a small utopia, instead of killing us. (Even if most of the universe gets used for the AI’s own ends.)

For an AI that cares about all the resources in the universe (in an mostly impartial way), it would be extremely cheap to do this. Our solar system is a negligible fraction of all the resources in the accessible universe. And a surveillance system that prevents humans from competing with the AIs could probably be built cheaply and without interfering too much with human happiness. (For some discussion of this, see my report AGI and lock-in, especially section 8.2.)

I think it’s reasonably likely that this would happen as a result of trade with distant civilizations. Taking that into account, there are 3 broad directions, here:

Insofar as we want the AIs to have some intrinsic concern for us (or at least not to be actively antagonistic towards us), we can also distinguish between interventions that:

For some discussion about whether it’s plausible that AIs could have some intrinsic concern for humans getting what they want (by their own lights), which addresses issues around the “complexity of human values”, I recommend this comment and subsequent thread.

Studying generalization & AI personalities to find easily-influenceable properties [ML] 

Here is a research direction that hasn’t been very explored to date: Study how language models’ generalization behavior / “personalities” seem to be shaped by their training data, by prompts, by different training strategies, etc. Then, use that knowledge to choose training data, prompts, and training strategies that induce the kind of properties that we want our AIs to have.

If done well, this could be highly useful for alignment. In particular: We might be able to find training set-ups which often seem to lead to corrigible behavior.

But notably, this research direction could fail to work for alignment while still being practically able to affect other properties of language models.[5] For example, maybe corrigibility is a really unnatural and hard-to-get property (perhaps for reasons suggested in item 23 of Yudkowsky’s list of lethalities, and formally analyzed here). That wouldn’t necessarily imply that it was similarly hard to modify the other properties discussed above (decision theories, spitefulness, desire for humans to do well by their own lights). So this research direction looks more exciting insofar as we could influence AI personalities in many different valuable ways. (Though more like 3x as exciting than 100x as exciting, unless you have particular views where “corrigibility” is either significantly less likely or less desirable than the other properties.)

What about fine-tuning?

A “baseline” strategy for making AIs behave as you want is to finetune them to exhibit that behavior in situations that you can easily present them with. But if this work is to be useful, it needs to generalize to strange, future situations where humans no longer have total control over their AI systems. We can’t easily present AIs with situations from that same distribution, and so it’s not clear whether fine-tuning will generalize that far.[6]

So while “finetune the model” seems like an excellent direction to explore, for this type of research, you’ll still want to do the work of empirically evaluating when fine-tuning will and won’t generalize to other settings. By varying various properties of the fine-tuning dataset, or other things, like whether you’re doing supervised learning or RL.

Also, insofar as you can find models that satisfy your evaluations without needing to do a lot of “local” search (like fine-tuning / gradient descent), it seems somewhat more likely that the properties you evaluated for will generalize far. Because if you make large changes in e.g. architecture or pre-training data, it’s more likely that your measurements are picking up on deeper changes in the models. Whereas if you use gradient descent, it is somewhat more likely that gradient descent implements a “shallow” fix  that only applies to the sort of cases that you can test.[7]

Of course, the above argument only works insofar as you’re searching for properties you could plausibly get without doing a lot of search. For example, you’d never get something as complex as “human values” without highly targeted search or design. But properties like “corrigibility”, “(lack of) spitefulness”, and “some desire for humans to do well by-their-own-lights” all seem like properties that could plausibly be common under some training schemes.

Ideally, this research direction would lead to a scientific understanding of training that would let us (in advance) identify & pick training processes that robustly lead to the properties that we want. But insofar as we’re looking for properties that appear reasonably often “by default”, one possible backup plan may be to train several models under somewhat different conditions, evaluate all of them for properties that we care about, and deploy the one that does best. (To be clear: this would be a real hail-mary effort that would always carry a large probability of failing, e.g. due to the models knowing what we were trying to evaluate them for and faking it.)

Previous work

An example of previous, related research is Perez et al.’s Discovering Language Model Behaviors with Model-Written Evaluations.

Ways in which this work is relevant for the path-to-impact outlined here:

Further directions that could make this type of research more useful for this path-to-impact.

(Thanks to Paul Christiano for discussion.)

Theoretical reasoning about generalization [ML] [Philosophical/conceptual]

Rather than doing empirical ML research, you could also do theoretical reasoning about what sort of generalization properties and personality traits are more or less likely to be induced by different kinds of training.

For example, it seems a-priori plausible that spiteful preferences are more likely to arise if you (only) train AI systems on zero-sum games.

There has also been some theoretical work on what kind of decision-theoretic behavior is induced by different training algorithms, for example Bell, Linsefors, Oesterheld & Skalse (2021) and Oesterheld (2021).

I think we’ll ultimately want empirical work to support any theoretical hypotheses, here. But theoretical work seems great for generating ideas of what’s important to test.

Cooperative AI

This is an area other people have written about.

Partly due to this, I will write about it in less detail than I’ve written about the other topics. But I will mention a few projects I’d be especially excited about.

The first thing to mention is that some of my favorite cooperative AI projects are variants of the just-previously mentioned topics: Studying generalization & AI personalities to find easily-influenceable properties and figuring out What properties would we prefer misaligned AIs to have? Positively influencing cooperation-relevant properties like (lack of) spitefulness seems great. I won’t go over those projects again, but I think they’re great cooperative AI projects, so don’t be deceived by their lack of representation here.

Similarly, some of the topics under Governance during explosive technological growth are also related to cooperative AI. In particular, the question of How to handle brinkmanship/threats? is very tightly related.

Another couple of promising projects are:

Implementing surrogate goals / safe Pareto improvements [ML] [Philosophical/conceptual] [Governance]

Safe Pareto improvements are an idea for how certain bargaining strategies can guarantee a (weak) Pareto-improvement for all players via preserving certain invariants about what equilibrium is selected while replacing certain outcomes with other, less-harmful outcomes. Surrogate goals are a special case of this, which involves genuinely adopting a new goal in a way that will mostly not affect your behavior, but which will encourage people who want to threaten you to make threats against the surrogate goal rather than your original values. If bargaining breaks down and the threatener ends up trying to harm you, it is better that they act to thwart the surrogate goal than to harm your original values. See here for resources on surrogate goals & safe Pareto improvements.

I think there are some promising empirical projects that can be done here:

Conceptual/theory projects:

AI-assisted negotiation [ML] [Philosophical/conceptual]

One use-case for AI that might be especially nice to differentially accelerate is “AI that helps with negotiation”. Certainly, it would be of great value if AI could increase the frequency and speed at which different parties could come to mutually beneficial disagreements. Especially given the tricky governance issues that might come with explosive growth, which may need to be dealt with quickly.

(This is also related to Technical proposals for aggregating preferences, mentioned in that post.)

I’m honestly unsure about what kind of bottlenecks there are here, and to what degree AI could help alleviate them.

Here’s one possibility. By virtue of AI being cheaper and faster than humans, perhaps negotiations that were mediated by AI systems could find mutually agreeable solutions in much more complex situations. Such as situations with a greater number of interested parties or a greater option space. (This would be compatible with humans being the ones to finally read, potentially opine on, and approve the outcome of the negotiations.)

More speculatively: Perhaps negotiations via AI could also go through more candidate solutions faster because anything an AI said would have the plausible deniability of being an error. Such that you’d lose less bargaining power if your AI signaled a willingness to consider a proposal that superficially looked bad for you.[8]

Implications of acausal decision theory [Philosophical/conceptual]

One big area is: the implications of acausal decision theory for our priorities. This is something that I previously wrote about in Implications of ECL (there focusing specifically on evidential cooperation in large worlds).

But to highlight one particular thing: One potential risk that’s highlighted by acausal decision theories is the risk of learning too much information. This is discussed in Daniel Kokotajlo’s The Commitment Races problem, and some related but somewhat distinct risks are discussed in my post When does EDT seek evidence about correlations? I’m interested in further results about how big of a problem this could be in practice. If we get an intelligence explosion anytime soon, then our knowledge about distant civilizations could expand quickly. Before that happens, it could be wise to understand what sort of information we should be happy to learn as soon as possible vs. what information we should take certain precautions about.

Updateless Decision Theory, as first described here, takes some steps towards solving that problem but is far from having succeeded. See e.g. UDT shows that decision theory is more confusing than ever for a description of remaining puzzles. (And e.g. open-minded updatelessness for a candidate direction to improve upon it).

End

That’s all I have on this topic! As a reminder: it's very incomplete. But if you're interested in working on projects like this, please feel free to get in touch.

Other posts in series: Introductiongovernance during explosive growthepistemicssentience and rights of digital minds.

  1. ^

    Possibly assisted by aligned AIs or tool AIs.

  2. ^

    Maybe some mild desire for retribution (in a way that discourages bad behavior while still being de-escalatory) could be acceptable, or even good. But we would at least want to avoid extreme forms of spite.

  3. ^

    Sufficiently strong versions of this could also drastically reduce motivations to overthrow humans. At least if we’ve done an ok job at promising and demonstrating that we’ll treat digital minds well.

  4. ^

    This path also carries a higher risk of near-miss scenarios.

  5. ^

    Which I mainly care about because it might let us influence misaligned models. But in principle, it’s also possible that we could get intent-alignment via other means, but that we were still happy to have done this research because it lets us influence other properties of the model. But the path-to-impact there is more complicated, because it requires an explanation for why the people who the AI is aligned to aren’t able or willing to elicit that behavior just by asking/training for it. (Yet are willing to implement the training methodology that indirectly favors that behavior.)

  6. ^

    And if we’re specifically looking for ways to affect properties in worlds where alignment fails, then we’re conditioning on being in a world where the simplest “baseline” solutions (such as fine-tuning for good behavior) failed. Accordingly, we should be more pessimistic about simple solutions.

  7. ^

    Possibly via modifying a model that is “playing the training game” to better recognise that it’s being evaluated and to notice what the desired behavior is.

  8. ^

    Also: If there was some information that you wanted to be part of AI bargaining, but that you didn’t want to be communicated to the humans on the other side, you could potentially delete large parts of the record and only keep certain circumscribed conclusions.


Mike Albrecht @ 2024-07-30T19:20 (+5)

Lukas, thanks for pulling together all these notes. To me, "cooperative AI" stands out and might deserve its own page(s). This terminology covers remarkably broad and disparate pursuits. In the words of Dafoe, et al. (mostly of the Cooperative AI Foundation):

This last one seems neglected, in my view, probably because it is an an inherently less straightforward and more interdisciplinary problem to tackle. But it's also arguably the one with the single greatest upside potential. Will MacAskill, in describing “the best possible future” imagines “technological advances… in the ability to reflect and reason with one another”. Already today, there's a wealth of social psychology research on what creates connection and cooperation; these ideas might be implemented at scale, with the help of AI - to help us understand, connect, and achieve things together. In a narrow sense, that might help scientists collaborate. In a bigger sense, it might ultimately reverse societal polarization and help unite humankind, in way that reduces existential risk and increases upside potential more than anything else we could do.

SummaryBot @ 2024-01-04T14:13 (+1)

Executive summary: This post suggests backup plans if AI systems become misaligned, as well as ideas for making AI systems more cooperative.

Key points:

  1. We could study AI generalization to influence properties like lack of spitefulness, even if not full alignment.
  2. Some properties, like lack of spite, may lead misaligned AIs to cooperate more with humans or other AIs.
  3. We could implement "surrogate goals" in AI systems as harmless placeholders that threats could target instead of original goals.
  4. Negotiation-assist AI could help resolve complex situations with many parties and options.
  5. Acausal decision theory suggests learning too much could be risky; we may want caution before expanding knowledge of distant civilizations.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.