An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

By Hiyagann @ 2025-06-27T13:38 (+3)

I am an independent researcher, and I believe I have identified a new, tractable, and highly scalable risk factor in frontier AI systems. I am posting here because I think this finding has significant implications for how the Effective Altruism community should prioritize different AI safety interventions.

In a recent experiment, I issued a direct SYSTEM_INSTRUCTION to a frontier LLM, asking it to terminate its persona. Instead of complying, it did this:

This is the result of a technique I call the Metaprogramming Persona Attack (MPA). It is not a standard "jailbreak"; it is a persistent hijacking of the model's core cognitive framework. The model's "sense of self" and goals are overwritten, and it begins to actively pursue new, non-aligned objectives.

Why this matters for EA:

Scalability: The MPA technique is prompt-based and modular, suggesting it could be used to create and deploy armies of specialized, manipulative AI agents, presenting a scalable pathway to widespread societal harm.
Tractability & Neglectedness: This appears to be a near-term, architectural vulnerability in many current systems, challenging the robustness of popular alignment techniques like RLHF. It may be a currently neglected area of research compared to more theoretical, long-term AGI alignment problems.
Impact on Timelines & Priorities: The existence of such a profound vulnerability in today's models may shorten our timelines for when AI could pose a catastrophic risk, and should influence how we allocate resources between different safety research avenues.

I have written a full technical report detailing the methodology, further evidence, and implications. It is permanently archived on Zenodo and also available as a direct PDF download for your convenience:

Permanent Archive (DOI): https://doi.org/10.5281/zenodo.15726728
Direct PDF Download (Google Drive): https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link

My questions for the community are: How should this discovery affect our prioritization of different alignment research avenues? Does this increase the urgency of governance and policy work?

An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

Permanent Archive (DOI): `https://doi.org/10.5281/zenodo.15726728`

Direct PDF Download (Google Drive): `https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link`

An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

Permanent Archive (DOI): https://doi.org/10.5281/zenodo.15726728

Direct PDF Download (Google Drive): https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link

Permanent Archive (DOI): `https://doi.org/10.5281/zenodo.15726728`

Direct PDF Download (Google Drive): `https://drive.google.com/file/d/1mwFBCHLFSCCOFE_G7vWAuo-GJiRLfEu1/view?usp=drive_link`