Why mechanistic interpretability does not and cannot contribute to long-term AGI safety (from messages with a friend)

By Remmelt @ 2022-12-19T12:02 (+17)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
Peter S. Park @ 2022-12-20T05:33 (+6)

Brilliant and compelling writeup, Remmelt! Thank you so much for sharing it. (And thank you so much for your kind words about my post! I really appreciate it.) 

I strongly agree with you that mechanistic interpretability is very unlikely to contribute to long-term AI safety. To put it bluntly, the fact that so many talented and well-meaning people sink their time into this unpromising research direction is unfortunate. 

I think we AI safety researchers should be more open to new ideas and approaches, rather than getting stuck in the same old research directions that we know are unlikely to meaningfully help. The post "What an actually pessimistic containment strategy looks like" has potentially good ideas on this front.

Remmelt @ 2022-12-20T06:08 (+6)

Yes, I think we independently came at similar conclusions.

Yulu Pi @ 2023-02-05T13:12 (+1)

Great post for thinking critically about MI.

MI research is only in its beginning stages, and many questions about the inner workings of the model have yet to be answered. As a result, I expect that for now, to make the research less difficult, we are currently setting aside interactions with humans and the broader environment.

Ultimately, however, these factors are all very important. Another explainability study, XAI, has increasingly focused on the significance of human-AI interactions and has proposed a human-centered approach. I'm waiting to see how MI expands its scope to cover this issue. Also, I have been wondering how existing research on transparency can flexibly adapt to its rapidly evolving form if neural networks (or more specifically, the transformers that have been the focus of MI research) are not the final form of AGI?