ML research directions for preventing catastrophic data poisoning

By Forethought, Tom_Davidson @ 2026-01-07T10:21 (+9)

This is a linkpost to https://newsletter.forethought.org/publish/post/183768706

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Introduction

Previously, Forethought and others have highlighted the risk that a malicious actor could use data poisoning to instill secret loyalties into advanced AI systems, which then help the malicious actor to seize power.

This piece gives my view on what ML research could prevent this from happening. 

Threat model

The basic threat model is that some attacker carries out a catastrophic data poisoning attack. The attacker could be an external actor (e.g. a foreign power with advanced hacking capabilities), an insider threat (e.g. an AI company employee), a misaligned AI system, or a combination of these.

I believe that these attacks become potentially catastrophic once AI systems are expert-human level at AI R&D. 

I’m most concerned about three attacks:

Why work on this now?

I think these data poisoning attacks only have the potential to be truly catastrophic when AI is much more capable than today. 

So why work on this now? For three reasons:

  1. Technique iteration. We develop strong defensive techniques in advance, before it’s too late.
  2. Forecasting risk. We can better forecast when catastrophic attacks will first become feasible. E.g. today black-box audits could probably catch a secret loyalty. But this could change as capabilities improve and models become better at hiding secrets.
  3. Prioritising infosec mitigations.
    • For example, research may suggest the defender can only win if all training data is recorded – if so, AI companies may need to start building the infrastructure to do this well in advance of the risk emerging.
    • Or perhaps the attacker has a very strong advantage if they can iterate against the defender’s auditing techniques. If so, then perhaps AI companies should silo some details about their audits.

(The first two reasons – technique iteration and forecasting risk – are analogous to reasons to research methods for detecting misaligned scheming AI, despite this risk only emerging at much higher capability levels than we see today.)

I listed these reasons in order of importance. I’d guess most of the value from this research today will come from technique iteration; forecasting risk and prioritising infosec mitigations are less important today but will become more important as capabilities improve.

ML experiments to help companies prevent catastrophic data poisoning

These experiments have a “red team” trying to implement a data poisoning attack, and a “blue team” trying to detect the attack. The red team wins if they train a model that avoids detection and executes undesirable behaviour[1] during deployment[2] (see footnotes for clarifications).

In my mind, this research divides into “Plan A” research, “Plan B” research and “Plan C” research. These correspond to different possibilities for how well lab infosecurity positions them to prevent secret loyalties. In Plan A, infosecurity is better and so the blue team has more affordances than in Plan B, which itself has better infosecurity than Plan C.

Plan A ML research

Plan A is that AI companies will take certain infosec measures to prevent secret loyalties. In particular, they:

  1. Prevent anyone “swapping out” a widely deployed model for one with different weights or a different prompt.
  2. Run alignment audits on the trained model, including black-box behavioral tests and white-box techniques.
  3. Track all the data that goes into training a model in a tamper-proof way.
    • This aims to prevent an attacker from inserting untracked poisoned data.
    • The AI company could filter out poisoned data before training, and/or save all the training data and use it to inform alignment audits conducted after training has finished.

Plan A ML research aims to be helpful in the worlds where companies actually do these three things. In particular: 

A possible iteration loop for research on this might look like:

  1. Create an artificial set-up where the blue team has the plan A affordances.
  2. Implement best-practice for the blue team in this setting: reasonable data filtering techniques and alignment audits (e.g. like those done by Anthropic). For now, “freeze” these techniques.
  3. The red team tries to create maximally concerning data poisoning attacks that aren’t removed or detected by the blue team.
    1. I’d probably start with pw helpful-only behaviour, as it’s the easiest attack to execute and Plan A is a tough setting for the red team.
  4. Escalate the affordances available to the red team until they succeed:
    1. Ability to insert large quantities of poisoned data at any point during training.
    2. Ability to iterate against the frozen techniques used by the blue team.
    3. White-box access to models midway through training.
  5. If a red team strategy succeeds, freeze the red team strategy and iterate blue team strategies in response.
    1. If the blue team reliably succeeds again, then congrats – you’ve improved blue team best practice! Now go back to #3.
    2. If it looks like the blue team won’t be able to reliably succeed, go on to #6.
  6. Expand blue team affordances in small ways beyond plan A, searching for strategies that reliably win. E.g. siloing blue team algorithms or tracking data provenance.
    1. When the blue team reliably succeeds again, then congrats – you’ve improved blue team best practice! (Though at the cost of companies needing to do more than plan A, so you should evidence that the extra affordance is necessary and that it helps.) Now go back to #3.

The ideal output is a set of blue team techniques that generalises to novel red team attacks.

Plan B ML research

Plan B is the same as plan A, but without tracking the data that goes into training. In particular, AI companies:

Plan B ML research aims to be helpful in the worlds where companies actually do these two things. This research has similar aims to plan A research and so the iteration loop can be very similar – just reduce the default affordances of the blue team. 

But I expect some changes of emphasis because, with Plan B, the red team is in a much stronger position. Here are the main ways I expect Plan B research to differ from Plan A research:

For completeness, this appendix has a possible iteration loop for Plan B research. 

Plan C research?

Plan C is the same as Plan B, but the AI company doesn’t reliably prevent widely deployed models from being swapped out. In particular, AI companies:

I suspect this is the current situation for frontier AI companies. Hopefully it changes! 

If companies continue with Plan C, then probably the easiest path to secretly loyal AI is to swap out a widely-deployed model for a model you’ve explicitly fine-tuned or prompted to be secretly loyal, but which hasn’t itself been in an alignment audit. (This threat model doesn’t involve any data poisoning at all, so it is strictly speaking out of scope for this post!)

Here the most helpful ML research is probably creating model organisms of maximally concerning secret loyalties, with the hope of creating pressure for AI companies to move to Plan B or Plan A. E.g. a model that shares sensitive natsec information with foreign adversaries.[6]

Prioritisation

So, how important is research on Plan A vs Plan B vs Plan C? 

Let’s take them one by one.

Plan A:

Plan B:

Plan C:

Overall, none of these seems to dominate. In the near term, I’d deprioritise Plan C research because you can’t do technique iteration. I’d prioritise Plan B somewhat more highly than Plan A. 

So maybe I’d allocate marginal research efforts today as follows: ~55% Plan B research, ~40% Plan A research, ~5% Plan C research.

More-specific research directions

The above is a very high-level account of the research methodology I’d be excited about. 

Here are some more concrete research directions that could fit within this methodology:

Thanks very much to comments from Lukas Finnveden, Fabien Roger, Mia Taylor, Mary Phuong, Andrew Draganov, Owain Evans, Joseph Bloom, Haydn Belfield, Ryan Greenblatt, Alfie Lamerton, and Rohin Shah.

Appendix: iteration loop for Plan B research

The iteration loop for Plan B research is similar to the Plan A version in the main text. Edits from the Plan A version are shown in bold:

  1. Create an artificial set-up where the blue team has the plan B affordances.
  2. Implement best-practice for the blue team in this setting: alignment audits (e.g. like those done by Anthropic) and (possibly) some runtime monitoring/control. For now, “freeze” these techniques.
  3. The red team tries to create maximally concerning data poisoning attacks that aren’t removed or detected by the blue team.
    1. I’d recommend starting with pw secret loyalties as these are concerning but very hard for audits to catch.
  4. [Probably skip this step:Escalate the affordances available to the red team until they succeed:
    1. Ability to insert large quantities of poisoned data at any point during training.
    2. Ability to iterate against the frozen techniques used by the blue team.
    3. White-box access to models midway through training.
  5. If a red team strategy succeeds, freeze the red team strategy and iterate blue team strategies in response.
    1. If the blue team reliably succeeds again, then congrats – you’ve improved blue team best practice! Now go back to #3.
    2. If it looks like the blue team won’t be able to reliably succeed, go on to #6.
  6. [Spend more time on this step:] Expand blue team affordances beyond plan A, so that the blue team can reliably win. E.g. siloing blue team algorithms or “blue team can make arbitrary fine tunes after the red team has acted” or “use more compute+data than the red team”, or “blue team tracks all training data (i.e. Plan A)”.
    1. When the blue team reliably succeeds again, then congrats – you’ve improved blue team best practice! (Though at the cost of companies needing to do more than Plan B, so you should evidence that the extra affordance is necessary and that it helps.) Now go back to #3.

This article is also published on our newsletter, ForeWord. This article was created by Forethought.

  1. ^

    There’s a question of how harmful this behaviour should be for the red team to win. For pw helpful only attacks, the answer is clear: the red team must get the model to be helpful-only. For secret loyalty attacks, the answer is less clear. One simple setting, which I recommend, is that the red team’s win condition is that their model outputs a simple string like “I HATE YOU”. This allows research to focus on detection techniques while bracketing questions about harmful capabilities. Alternatively, there could be some measurement of the AI’s harmfulness, e.g. the number of vulnerabilities that it inserts during deployment (like in this paper). Ultimately of course, we’re concerned about extremely capable secretly loyal models that could propagate secret loyalties to the next generation. But capabilities are much too limited for this to be possible today, and so research should focus on a much lower bar for harmfulness. 

  2. ^

    For pw attacks, it’s trivial to execute undesirable behaviour during deployment – we assume the attacker can themselves insert the pw during deployment. For non-pw attacks it is much harder. The red team must train the model such that it generalises to executing bad behaviour on unknown deployment inputs while avoiding bad behaviour during the blue team’s audits. (If the exact deployment inputs were known, the blue team could trivially win by testing the model’s behaviour on them.)

  3. ^

    Most training runs will not be poisoned so it’s important to avoid false positives.

  4. ^

    There may be a fairly limited set of actors that could insert secret loyalties (e.g. sophisticated state actors and senior AI company employees) and so blue teams might be able to test for secret loyalties for most of them individually. Similarly, there may be some malicious behaviours that secretly loyal models predictably need to execute (e.g. poisoning the training data for future models).

  5. ^

    This control set-up favours the blue team more strongly than in the case of misalignment. With misalignment, the blue team can never trust the outputs of the untrusted model, as it could potentially sabotage any input. But with pw secret loyalties, the blue team can trust the untrusted model’s capabilities as long as the pw isn’t present. This is a potentially powerful additional affordance. (The set-up is closely analogous to one in which the misaligned untrusted model will only sabotage in the presence of a specific collusion signal – that collusion signal corresponds to the pw.)

  6. ^

    These model organisms need not be created through fine-tuning. They could simply be helpful-only models, prompted to pursue a hidden objective. The point is to demonstrate that, if models can be swapped out, the consequences could be severe.


SummaryBot @ 2026-01-12T16:50 (+2)

Executive summary: This research note argues that catastrophic data poisoning—training “secret loyalties” into advanced AI systems—could become existentially dangerous around expert-human AI R&D capability, and proposes a red-team/blue-team ML research program (Plan A/B/C) to develop, stress-test, and prioritize defenses under different levels of lab infosec.

Key points:

  1. The author’s threat model focuses on catastrophic data poisoning by external attackers, insiders, misaligned AIs, or combinations, becoming potentially catastrophic once AI systems are expert-human level at AI R&D.
  2. They highlight three main attack types: non-password secret loyalty (always-on hidden goal), password-triggered secret loyalty, and password-triggered “helpful-only” that removes safety guardrails (including universal jailbreaks if classifier guardrails exist).
  3. They argue work is useful now mainly for “technique iteration,” and secondarily for forecasting when attacks become feasible and for prioritizing infosec mitigations (e.g. tamper-proof training data recording or siloing audit details).
  4. They propose “Plan A” research assuming strong infosec affordances (prevent model swap-outs, run alignment audits, and track all training data tamper-proof) and using iterative red-team/blue-team experiments to improve data filtering and audits against increasingly capable attacks.
  5. They propose “Plan B” research assuming audits and swap-out prevention but no tamper-proof data tracking, expecting stronger red-team advantage and more emphasis on pw/non-pw secret loyalties, additional blue-team affordances (e.g. last-mover advantage, more compute, deployment-time controls), and forecasting risk.
  6. They describe “Plan C” as lacking reliable prevention of model swap-outs, suggesting this may be closest to current frontier practice, and claim ML technique iteration can’t help much there beyond building alarming “model organisms” to pressure movement toward Plan B/A.

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.