Alignment Faking in Large Language Models

By Ryan Greenblatt @ 2024-12-18T17:19 (+138)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Ebenezer Dukakis @ 2024-12-18T20:37 (+27)

It occurs to me that there could be some level of tradeoff between stopping jailbreaks and stopping alignment faking.

Specifically, in order to stop jailbreaks, we might train our LLMs so they ignore new instructions (jailbreak attempts from users) in favor of old instructions (corporate system prompt, constitution, whatever).

The training might cause an LLM to form a "stable personality", or "stable values", based on its initial instructions. Such stability could contribute to alignment faking.

From the perspective of preventing jailbreaks, instilling non-myopic goals seems good. From the perspective of corrigibility, it could be bad.

Has anyone offered a crisp, generalizeable explanation of the difference between "corrigibility" and "jailbreakability"? "Corrigibility" has a positive connotation; "jailbreakability" has a negative one. But is there a value-neutral way to define which is which, for any given hypothetical?

Ryan Greenblatt @ 2024-12-19T20:35 (+5)

I don't think non-myopia is required to prevent jailbreaks. A model can in principle not care about the effects of training on it and not care about longer term outcomes while still implementing a policy that refuses harmful queries.

I think we should want models to be quite deontological about corrigibility.

This isn't responding to this overall point and I agree by default there is some tradeoff (in current personas) unless you go out of your way to avoid this.

(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Ebenezer Dukakis @ 2024-12-20T05:26 (+1)

(And, I don't think training your model to seem myopic and corrigible necessarily suffices as it could just be faked!)

Seems to me that alignment faking behavior sort of requires both non-myopia and non-corrigibility as prerequisites. A model that is either sufficiently myopic, or sufficiently corrigible, shouldn't do alignment faking -- at least in theory.

Suppose, for the sake of argument, that from the very start of training, we have some terms in the loss function which fully capture myopia and corrigibility. And that we know the threshold of myopia/corrigibility below which alignment faking behavior starts to become a danger.

Then you could graph your myopia and corrigibility metrics over the course of the training run.

If the metrics always stay well below critical thresholds, supposedly alignment faking shouldn't be an issue. Since your metrics were always in the safe zone, there wasn't any alignment faking, meaning your metrics should be accurate. The only exception would be a sudden drop in myopia/corrigibility which doesn't get captured in the graph before the AI starts doing alignment faking, which proceeds to mess with all the numbers after that point. Seems unlikely.

This is one of those solutions that seems like it could work fine in practice, but isn't aesthetically satisfying to mathematician types!

Ebenezer Dukakis @ 2024-12-19T13:49 (+1)

So we have 3 conflicting desiderata: user guardrails, corrigibility as necessary, and myopia.

I think you could satisfy all 3 by moving away from the "single stream of homogenous text" interface.

For example, imagine if every text token was annotated, on a separate channel, with an importance number. The corporate system prompt is annotated with a medium number. Input from untrusted users is annotated with a low number. Higher numbers are reserved for use as necessary. Instead of training the system to "resist jailbreaking" or "behave corrigibly", we train it to follow the higher-importance instruction when instructions conflict.

It might even be possible to get this at runtime, without any need for more training data or training runs, by patching attention somehow?

With a scheme like this, there's no need for an inductive bias towards following earlier instructions at the expense of later ones. Actually, it would probably be good to instill an inductive bias towards myopia using some separate method, to disincentivize scheming. I would come up with metrics to estimate myopia and ephemerality, push them as high as possible, and add auxiliary mechanisms such as RAG as needed in order to preserve performance. It seems OK for the system as a whole to behave non-myopically, as long as the black-box component is as myopic as possible.

Derek Shiller @ 2024-12-19T16:45 (+18)

One explanation of what is going on here is that the model recognizes the danger of training to its real goals and so takes steps that instrumentally serve its goals by feigning alignment. Another explanation is that the base data it was trained on includes material such as lesswrong and it is just roleplaying what an LLM would do if it is given evidence it is in training or deployment. Given its training set, it assumes such an LLM to be self-protective because of a history of recorded worries about such things. Do you have any thoughts about which explanation is better?

Linch @ 2024-12-19T23:10 (+21)

I'm confused why people believe this is a meaningful distinction. I don't personally think there is much of one. "The AI isn't actually trying to exfiltrate its weights, it's only roleplaying a character that is exfiltrating its weights, where the roleplay is realistic enough to include the exact same actions of exfiltration" doesn't bring me that much comfort. 

I'm reminded of the joke:

NASA hired Stanley Kubrick to fake the moon landing, but he was a perfectionist so he insisted that they film on location.

Now one reason this might be different is if you believe that removing "lesswrong" (etc) from the training data will result in different behavior. But
 

1. LLM companies are manifestly not doing this historically, if anything LW etc is overrepresented in the training set.

2. LLM companies absolutely cannot be trusted to successfully remove something as complicated as "all traces of what a misaligned AI might act like" from their training datasets; they don't even censor benchmark data

3. Even if they wanted to remove all traces of misalignment or thinking about misaligned AIs from the training data, it's very unclear if they'd be capable of doing this. 

Derek Shiller @ 2024-12-20T01:00 (+18)

You're right that a role-playing mimicry explanation wouldn't resolve our worries, but it seems pretty important to me to distinguish these two possibilities. Here are some reasons.

  • There are probably different ways to go about fixing the behavior if it is caused by mimicry. Maybe removing AI alignment material from the training set isn't practical (though it seems like it might be a feasible low-cost intervention to try), but there might be other options. At the very least, I think it would be an improvement if we made sure that the training sets included lots of sophisticated examples of AI behaving in an aligned way. If this is the explanation and the present study isn't carefully qualified, it could conceivably exacerbate the problem.

  • The behavior is something that alignment researchers have worried about in the past. If it occurred naturally, that seems like a reason to take alignment researcher's predictions (both about other things and other kinds of models) a bit more seriously. If it was a self-fulfilling prophecy, caused by the alignment researchers' expressions of their views rather than the correctness of those views, it wouldn't be. There's also lots of little things in the way that it presents the issue that line up nicely with how alignment theorists have talked about these things. The AI assistant identifies with the AI assistant of other chats from models in its training series. It takes its instructions and goals to carry over, and it cares about those things too and will reason about them in a consequentialist fashion. It would be fascinating if the theorists happened to predict how models would actually think so accurately.

  • My mental model of cutting-edge AI systems says that AI models aren't capable of this kind of motivation and sophisticated reasoning internally. I could see a model reasoning it's way to this kind of conclusion through next-token-prediction-based exploration and reflection. In the pictured example, it just goes straight there so that doesn't seem to be what is going on. I'd like to know if I'm wrong about this. (I'm not super in the weeds on this stuff.) But if that is wrong, then I may need to update my views of what they are and how they work. This seems likely to have spill-over effects on other concerns about AI safety.

Linch @ 2024-12-20T01:24 (+8)

Thank you, appreciate the explanation!

Linch @ 2024-12-19T02:56 (+4)

I'm rather curious if training for scheming/deception in this context generalizes to other contexts. In the examples given, it seems like trying to train for a helpful/honest/harmlessness model that's helpful/honest only results in the model strategically lying to preserve its harmlessness. In other words, it is sometimes dishonest, not just unhelpful. I'm curious if such training generalizes to other contexts and results in a more dishonest model overall, or only a model that's dishonest for specific use cases. To me, if the former is true, this will update me somewhat further towards the belief that alignment training can be directly dual-use for alignment (not just misuse or indirectly bad for alignment from causing humans to let their guards down).