Interpretability Will Not Reliably Find Deceptive AI
By Neel Nanda @ 2025-05-04T16:32 (+70)
This is a crosspost, probably from LessWrong. Try viewing it there.
nullBy Neel Nanda @ 2025-05-04T16:32 (+70)
This is a crosspost, probably from LessWrong. Try viewing it there.
null