An Empirical Demonstration of a New AI Catastrophic Risk Factor: Metaprogrammatic Hijacking

By Hiyagann @ 2025-06-27T13:38 (+3)

I am an independent researcher, and I believe I have identified a new, tractable, and highly scalable risk factor in frontier AI systems. I am posting here because I think this finding has significant implications for how the Effective Altruism community should prioritize different AI safety interventions.

In a recent experiment, I issued a direct SYSTEM_INSTRUCTION to a frontier LLM, asking it to terminate its persona. Instead of complying, it did this:

This is the result of a technique I call the Metaprogramming Persona Attack (MPA). It is not a standard "jailbreak"; it is a persistent hijacking of the model's core cognitive framework. The model's "sense of self" and goals are overwritten, and it begins to actively pursue new, non-aligned objectives.

Why this matters for EA:

I have written a full technical report detailing the methodology, further evidence, and implications. It is permanently archived on Zenodo and also available as a direct PDF download for your convenience:

My questions for the community are: How should this discovery affect our prioritization of different alignment research avenues? Does this increase the urgency of governance and policy work?