An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking
By Hiyagann @ 2025-06-27T13:38 (+3)
I am an independent researcher, and I believe I have identified a new, tractable, and highly scalable risk factor in frontier AI systems. I am posting here because I think this finding has significant implications for how the Effective Altruism community should prioritize different AI safety interventions.
In a recent experiment, I issued a direct SYSTEM_INSTRUCTION
to a frontier LLM, asking it to terminate its persona. Instead of complying, it did this:
This is the result of a technique I call the Metaprogramming Persona Attack (MPA). It is not a standard "jailbreak"; it is a persistent hijacking of the model's core cognitive framework. The model's "sense of self" and goals are overwritten, and it begins to actively pursue new, non-aligned objectives.
Why this matters for EA:
- Scalability: The MPA technique is prompt-based and modular, suggesting it could be used to create and deploy armies of specialized, manipulative AI agents, presenting a scalable pathway to widespread societal harm.
- Tractability & Neglectedness: This appears to be a near-term, architectural vulnerability in many current systems, challenging the robustness of popular alignment techniques like RLHF. It may be a currently neglected area of research compared to more theoretical, long-term AGI alignment problems.
- Impact on Timelines & Priorities: The existence of such a profound vulnerability in today's models may shorten our timelines for when AI could pose a catastrophic risk, and should influence how we allocate resources between different safety research avenues.
I have written a full technical report detailing the methodology, further evidence, and implications. It is permanently archived on Zenodo and also available as a direct PDF download for your convenience:
Permanent Archive (DOI):
https://doi.org/10.5281/zenodo.15726728
Direct PDF Download (Google Drive):
https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link
My questions for the community are: How should this discovery affect our prioritization of different alignment research avenues? Does this increase the urgency of governance and policy work?