3 levels of threat obfuscation

By Holden Karnofsky @ 2023-08-02T17:09 (+31)

This is a linkpost to https://www.alignmentforum.org/posts/HpzHjKjGQ4cKiY3jX/3-levels-of-threat-obfuscation

One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment). I write more about this general fear in AI Safety Seems Hard to Measure.

I think it’s very important to understand threat obfuscation dynamics; to try to design ways of assessing danger that are designed with threat obfuscation dynamics in mind; and to avoid interpreting reassuring-seeming evidence too optimistically, by keeping in mind the various threat obfuscation dynamics that could affect it.

I’ve seen three noticeably different threat obfuscation dynamics discussed, in ways that often leave me confused about exactly what dynamic someone is talking about, and when. This post aims to clear that up, based on a conversation with Paul Christiano.

The three dynamics are:

Simple “training game” dynamics: an AI system appears to behave helpfully, rather than attempting deception/manipulation/disempowerment of humans, because (a) it is good at predicting whether such attempts will be caught; (b) it is aiming for good in-episode performance (in terms of loss or reward); (c) such attempts would result in poor in-episode performance if caught. But if an AI system had a chance to do deception/manipulation/disempowerment that would not be detected, and would result in “good performance” within an episode, it would do so (this could include large-scale world takeover).
- These dynamics are dangerous because we might observe dangerous behavior, train AIs to behave less dangerously, then observe less dangerous behavior, and conclude that we’ve become more safe (when in fact we’ve trained the AIs to be more effective at deception).
- But we can still detect dangerous behavior in cases where it would not result in poor in-episode performance. For example, we might decide that when an AI system in the wild hacks its way to good in-episode performance, and is caught after the fact, we will at that point act as if the performance was actually good/intended (for example, no gradient descent against the behavior). Or we might deliberately conduct tests in which AI systems have opportunities to take over digital environments, obtaining high reward (no catch, not a trap, no attempt to "train out" the behavior when it happens.^[1] In these cases, situationally aware AI systems might “show their true colors” with fairly straightforward attempts to draw them out.
Deceptive alignment: an AI system forms some aim (via inner misalignment) other than good in-episode performance. (This could include something like “an AI system that wants to maximize digital representations of reward across the galaxy, and is willing to get poor in-episode performance as part of a longer-term mission for that to happen.”) It then gets good in-episode performance as part of a longer-term scheme to accomplish its other aim.
- In this case, the above techniques for getting an AI to “show its true colors” might^[2] fail.
- However, we could still get an AI to engage in specific behaviors that might be helpful to us and harmful to its goals, by specifically training the AI to behave in these ways. For example, we might want an AI to find security vulnerabilities that we can then patch. Even if doing so would undermine the AI’s longer-term aim, we can use gradient descent to put a lot of pressure on it to find security vulnerabilities anyway. (Related: the “no sandbagging on checkable tasks” hypothesis)
Gradient hacking: this also requires that an AI system has some aim other than good in-episode performance. Gradient hacking is a way for an AI system to specifically “defend itself” against being trained to do things counter to its goals. It might look something like: “An AI system checks its own policy against some reference policy that is good for its goals; the greater the divergence, the more it sabotages its own performance, with the result that gradient descent has trouble getting its policy to diverge from the reference policy.”

Each of these seems more devastating to alignment research than the previous one - and less likely to emerge in early transformative AI systems. I consider the “training game” very likely, and quite devastating; but it leaves us many possible interventions that the less likely “deceptive alignment” and “gradient hacking” don’t.

That was most of what I wanted to cover, but the rest of this post will give some quick notes on (a) why the more severe forms of threat obfuscation seem relatively less likely; (b) how bad things look if we only have to deal with the first one (“training game”).

Training game seems very likely by default, deceptive alignment much less so, gradient hacking still less so

This is pretty superficial and mostly based on a couple of conversations with Paul Christiano. I’m hoping he’ll write a better version of this at some point.

The “training game” mode of threat obfuscation seems like it is very likely by default, because a normal AI training process will probably directly incentivize it. Errors in labeling will directly favor the training game over intended generalization; even with perfect labeling, there’s no particular reason to think that the training game is disfavored compared to the intended generalization (if anything it vaguely/intuitively seems simpler - “do that which gets treated as intended behavior” as opposed to “do that which was intended.”)

By contrast, deceptive alignment seems to face a bigger hurdle. For example, if you train an AI with positive reinforcement for getting cookies:

A deceptively aligned AI has to, every time it’s deciding how to get cookies, go through a “thought process” like: “I am aiming for [thing other than cookies]. But humans want me to get cookies. And humans are in control right now [if not I need to behave differently]. Therefore I should get cookies.”
Contrast this with an AI that just responds to rewards for cookies by going through this thought process: “I’m trying to get cookies.”
The former “thought process” could be noticeably more expensive (compute wise or whatever), in which case heavy optimization would push against it. (I think this is plausible, though I’m less than convinced; the former thought process doesn’t seem like it is necessarily much more expensive conditional on already having a situationally aware agent that thinks about the big picture a lot.)
An additional issue is that deceptive alignment only happens if you get inner misalignment resulting in an AI with some nonindexical “aim” other than in-episode reward. This could happen but it’s another conjunct.
Paul thinks it’s <50% likely (clarification in footnote^[3]) that deceptive alignment is going to be a major issue by default (at least if people make reasonable efforts to avoid the sorts of dynamics discussed in the next section), and other folks whose judgment I’d put a lot of weight on (due to seeming fluent with the concern about deceptive alignment, while also experienced and/or accomplished with ML) seem to generally be in the same camp.

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

How bad is the situation if straightforward training leads to the training game, but not to deceptive alignment or gradient hacking?

At first glance, this situation doesn’t seem too bad in terms of threat obfuscation. It basically means that measuring an AI’s dangerousness is as simple as “Put it in an environment where doing malicious things is good in-episode performance. A possible example: if it does malicious things to gain control of its environment, simply let it control its environment for a while and don’t do any of the things one would normally do for unintended behavior (such as trying to use gradient descent to correct against it).” This could result in a dangerous, situationally aware AI (e.g., one that has learned a pattern "Do malicious things iff they won't be caught and trained against") going ahead and doing malicious things while being tested, with no particular reason to conceal them.

However, it’s pretty easy for this to go off the rails if people aren’t careful about it. I’d be concerned about a dynamic like this:

AIs frequently have opportunities to behave maliciously and gain resources within an episode. These opportunities look good on their reward models. They take these opportunities.
Humans respond by trying to train out this sort of behavior - e.g., finding incidents like this via audits and news stories, and going back and doing major gradient updates against the corresponding behaviors/reward models, long after the fact.
In so doing, they could effectively be training AIs to build longer-term, more careful models of human reactions to their behavior. For a situationally aware AI, training like this could lead to behavior like “Try to accumulate resources and do things associated with high reward, unless humans might catch it later via an audit; hence, look for opportunities to avoid or prevent or manipulate eventual audits.”
This sort of behavior could end up looking a lot like deceptive alignment: optimizing heavily around factors and events well outside the “episode” as traditionally defined (you could also think of this as effectively extending the “episode” to encompass a very long time period during which humans might do audits and respond to what they see).

To avoid this, we need to be careful with how we handle unexpected incidents of AIs’ behaving maliciously. It won’t always be the right move to straightforwardly “train against” such incidents; it may sometimes be important to do something more like “starting over,” e.g. insisting on a more principled way^[4] of making AIs less prone to malicious behavior.

This situation seems scary:

We won’t necessarily be able to come up with a principled (and competitive) way of driving out malicious behavior.
At least some AI developers (and maybe many) will be doing non-principled things (e.g., the problematic “train out the malicious behavior” thing above); saying “Behold, our systems are safe!”; and wanting to deploy them.
The line between principled and non-principled things doesn’t feel super clear.

But this situation still seems much better than having to deal with deceptive alignment: we do have a tangible, non-interpretability-dependent way of detecting danger if we can be disciplined about it! In particular, a successful “evals” project looks very tractable in this world - you’re basically running the “make rewards available for bad behavior” test, and then auditing training for signs that it’s been explicitly trained against.

I'm not saying this situation is the only one we should plan around. But it is an example of how relatively good luck could put us in a position where some theoretically very difficult aspects of alignment end up being less difficult in reality.

Notes

^{^}
The “catch” would be that we would want to avoid deploying similar AI systems in other contexts. But this is not a concern for the “simple training game” AI that is focused on obtaining good in-episode performance. It is a concern for an AI that has a goal other than good in-episode performance.
^{^}
Though some versions might work - see "Providing unusual incentives ..." here.
^{^}
“Without thinking carefully I’d guess there is a 25% chance that deceptive alignment is a major challenge that arises early enough that we need to deal with it ourselves. Then another 25% that it’s a challenge that someone needs to deal with someday, but is too far in the future to be worth thinking about now e.g. because AI systems will have obsoleted all of our thinking before they are deceptively aligned. And another 25% chance that deceptive alignment just never arises in the real world if you are careful and take the kind of simple precautions that we already understand. And a final 25% where the entire concern was misguided and it never happens in practice with modern ML no matter how careless you are.”
^{^}
Such as via internals-based training.