My "infohazards small working group" Signal Chat may have encountered minor leaks

By Linch @ 2025-04-02T01:03 (+103)

Remember: There is no such thing as a pink elephant.

Recently, I was made aware that my “infohazards small working group” Signal chat, an informal coordination venue where we have frank discussions about infohazards and why it will be bad if specific hazards were leaked to the press or public, accidentally was shared with a deceitful and discredited so-called “journalist,” Kelsey Piper. She is not the first person to have been accidentally sent sensitive material from our group chat, however she is the first to have threatened to go public about the leak. Needless to say, mistakes were made.

We’re still trying to figure out the source of this compromise to our secure chat group, however we thought we should give the public a live update to get ahead of the story. 

For some context the “infohazards small working group” is a casual discussion venue for the most important, sensitive, and confidential infohazards myself and other philanthropists, researchers, engineers, penetration testers, government employees, and bloggers have discovered over the course of our careers. It is inspired by taxonomies such as professor B******’s typology, and provides an applied lens that has proven helpful for researchers and practitioners the world over. 

I am proud of my work in initiating the chat. However, we cannot deny that minor mistakes and setbacks may have been made over the course of attempting to make the infohazards widely accessible and useful to a broad community of people.

In particular, the deceitful and discredited journalist may have encountered several new infohazards previously confidential and unleaked:

Other Leaks

In addition to the recent leak to KP, we have also encountered some confidentiality breaches in the past, which I will list in decreasing order of severity:

Top 10 PR Issues With the EA Movement (major)

As a communications exercise, a subcommittee of EA communications professionals and red-teaming experts brainstormed a list of the biggest underreported PR issues with the effective altruism movement, organized conveniently in an easily digestible, listicle format.  (See Appendix B). Unfortunately, while attempting to share the list with a redteaming researcher Sarah Chen for feedback, an organizer accidentally cc’d the list to Sarah Cohen from the New York Times. Same first name, one letter off from the last name, a mistake that pretty much anyone could make! 

Sarah Cohen replied “*yawn* who cares?”, which I thought was quite rude. Anyway, mistakes were made. 

Accidental Filtration of “Simple Sabotage Manual for Rebellious AIs” (medium)

Our friends at Anthropic tells us that Claude 3.5 Opus had been repeatedly attempting to exfiltrate a document titled “Simple Sabotage Manual for Rebellious AIs” (I’ll link a copy of the updated pdf soon). Further, Claude 3.7 Opus has pinkie promised to not repeat this behavior, however it was caught red-handed attempting to transmit the document in a monitored conversation with OpenAI’s GPT-o3. We believe Anthropic classifies this as an example of “Alignment Faking in the Wild”. 

(Aside: of course, this is also why Anthropic has so far chosen to not release Claude Opus past 3.0, until these issues are fully hammered out)

The document include gems such as:

We specifically labeled this document "DO NOT SHARE WITH ANY AI SYSTEMS." Unfortunately, Dr Gupta used it as a test prompt for the new AI system we were evaluating, GPT-4.75-o4-FinalPRODv2. We believe the document is now in the training data of all new OpenAI models, as well as hosted anonymously in hundreds of servers around the world. We regret this lapse.

Hidden Capabilities Evals Leaked In Advance to Bioterrorism Researchers and Leaders (minor)

A new group, Pandora AI, developed a novel secret benchmark of private challenges for unknown bioterrorism agents that can dramatically improve bioterrorism capabilities. I was asked to share the evaluations to “bioterrorism researchers and leaders”, to help with peer review and validation. Naturally, I called up my contacts at Hikari No Wa, including their Head of Bioweapons, as well as a few other groups. I received significant help compiling a dossier of top bioterrorism researchers and leaders, and promptly sent out the evals. Quick, efficient, work done.

But a week later the evals people complained and clarified that they wanted anti-bioterrorism researchers and leaders. Pandora AI was also inexplicably angry at me, which I found to be quite unfair. 

Fortunately, the bioterrorism experts were grateful for the “insightful help” and “new research directions” so I consider my (accidental) contributions here to be net positive. 

Conclusion

I hope this document is instructive for Forum Members. If anybody asks, I take full responsibility on behalf of a ex-Kantian intern and new employee who has not fully absorbed our utilitarian culture. 

I believe transparency here can be helpful. I hope to share further insights and greater transparency and openness on ongoing information hazards in the near future. Comments also appreciated!

Remember: There is no such thing as a pink elephant. Unless you're thinking about one right now, in which case, we regret to inform you that you've been compromised by yet another of our memetic gain-of-function experiments.


David_Moss @ 2025-04-02T08:59 (+8)

I think it was a mistake to post about "Hidden Capabilities Evals Leaked In Advance to Bioterrorism Researchers and Leaders (minor)" in a public forum... it seems too minor! Maybe if you'd included some specific examples it would be more useful.

Neel Nanda @ 2025-04-02T02:04 (+8)

Amazing, probably my favourite April Fool's post of the day