Interpretability Will Not Reliably Find Deceptive AI

By Neel Nanda @ 2025-05-04T16:32 (+70)

This is a crosspost, probably from LessWrong. Try viewing it there.

null