3 levels of threat obfuscation

By Holden Karnofsky @ 2023-08-02T17:09 (+31)

This is a linkpost to https://www.alignmentforum.org/posts/HpzHjKjGQ4cKiY3jX/3-levels-of-threat-obfuscation

One of the biggest reasons alignment might be hard is what I’ll call threat obfuscation: various dynamics that might make it hard to measure/notice cases where an AI system has problematic misalignment (even when the AI system is in a controlled environment and one is looking for signs of misalignment). I write more about this general fear in AI Safety Seems Hard to Measure.

I think it’s very important to understand threat obfuscation dynamics; to try to design ways of assessing danger that are designed with threat obfuscation dynamics in mind; and to avoid interpreting reassuring-seeming evidence too optimistically, by keeping in mind the various threat obfuscation dynamics that could affect it.

I’ve seen three noticeably different threat obfuscation dynamics discussed, in ways that often leave me confused about exactly what dynamic someone is talking about, and when. This post aims to clear that up, based on a conversation with Paul Christiano.

The three dynamics are:

Each of these seems more devastating to alignment research than the previous one - and less likely to emerge in early transformative AI systems. I consider the “training game” very likely, and quite devastating; but it leaves us many possible interventions that the less likely “deceptive alignment” and “gradient hacking” don’t.

That was most of what I wanted to cover, but the rest of this post will give some quick notes on (a) why the more severe forms of threat obfuscation seem relatively less likely; (b) how bad things look if we only have to deal with the first one (“training game”).

Training game seems very likely by default, deceptive alignment much less so, gradient hacking still less so

This is pretty superficial and mostly based on a couple of conversations with Paul Christiano. I’m hoping he’ll write a better version of this at some point.

The “training game” mode of threat obfuscation seems like it is very likely by default, because a normal AI training process will probably directly incentivize it. Errors in labeling will directly favor the training game over intended generalization; even with perfect labeling, there’s no particular reason to think that the training game is disfavored compared to the intended generalization (if anything it vaguely/intuitively seems simpler - “do that which gets treated as intended behavior” as opposed to “do that which was intended.”)

By contrast, deceptive alignment seems to face a bigger hurdle. For example, if you train an AI with positive reinforcement for getting cookies:

Gradient hacking intuitively seems harder still. The preconditions for it seem to be something like “the preconditions for deceptive alignment, plus the AI figuring out some effective maneuver to execute with the design of its own brain.”

How bad is the situation if straightforward training leads to the training game, but not to deceptive alignment or gradient hacking?

At first glance, this situation doesn’t seem too bad in terms of threat obfuscation. It basically means that measuring an AI’s dangerousness is as simple as “Put it in an environment where doing malicious things is good in-episode performance. A possible example: if it does malicious things to gain control of its environment, simply let it control its environment for a while and don’t do any of the things one would normally do for unintended behavior (such as trying to use gradient descent to correct against it).” This could result in a dangerous, situationally aware AI (e.g., one that has learned a pattern "Do malicious things iff they won't be caught and trained against") going ahead and doing malicious things while being tested, with no particular reason to conceal them.

However, it’s pretty easy for this to go off the rails if people aren’t careful about it. I’d be concerned about a dynamic like this:

To avoid this, we need to be careful with how we handle unexpected incidents of AIs’ behaving maliciously. It won’t always be the right move to straightforwardly “train against” such incidents; it may sometimes be important to do something more like “starting over,” e.g. insisting on a more principled way[4] of making AIs less prone to malicious behavior.

This situation seems scary:

But this situation still seems much better than having to deal with deceptive alignment: we do have a tangible, non-interpretability-dependent way of detecting danger if we can be disciplined about it! In particular, a successful “evals” project looks very tractable in this world - you’re basically running the “make rewards available for bad behavior” test, and then auditing training for signs that it’s been explicitly trained against.

I'm not saying this situation is the only one we should plan around. But it is an example of how relatively good luck could put us in a position where some theoretically very difficult aspects of alignment end up being less difficult in reality.

Notes


  1. ^

    The “catch” would be that we would want to avoid deploying similar AI systems in other contexts. But this is not a concern for the “simple training game” AI that is focused on obtaining good in-episode performance. It is a concern for an AI that has a goal other than good in-episode performance.

  2. ^

    Though some versions might work - see "Providing unusual incentives ..." here.

  3. ^

    “Without thinking carefully I’d guess there is a 25% chance that deceptive alignment is a major challenge that arises early enough that we need to deal with it ourselves. Then another 25% that it’s a challenge that someone needs to deal with someday, but is too far in the future to be worth thinking about now e.g. because AI systems will have obsoleted all of our thinking before they are deceptively aligned. And another 25% chance that deceptive alignment just never arises in the real world if you are careful and take the kind of simple precautions that we already understand. And a final 25% where the entire concern was misguided and it never happens in practice with modern ML no matter how careless you are.”

  4. ^