A Barebones Guide to Mechanistic Interpretability Prerequisites

By Neel Nanda @ 2022-11-29T18:43 (+54)

This is a linkpost to https://neelnanda.io/mechanistic-interpretability/prereqs

Co-authored by Neel Nanda and Jess Smith

Crossposted on the suggestion of Vasco Grilo

Why does this exist?

People often get intimidated when trying to get into AI or AI Alignment research. People often think that the gulf between where they are and where they need to be is huge. This presents practical concerns for people trying to change fields: we all have limited time and energy. And for the most part, people wildly overestimate the actual core skills required.

This guide is our take on the essential skills required to understand, write code and ideally contribute useful research to mechanistic interpretability. We hope that it’s useful and unintimidating. :)

Core Skills:

Beyond the above, if you have the prerequisites, a good way to get more into the field may be checking out my extremely opinionated list of my favourite mechanistic interpretability papers

Note that there are a lot more skills in the “nice-to-haves”, but I think that generally the best way to improve at something is by getting your hard dirty and engaging with the research ideas directly, rather than making sure you learn every nice-to-have skill first - if you have the above, I think you should just jump in and start learning about the topic! Especially for the coding related skills, your focus should not be on getting your head around concepts, it should be about doing, and actually writing code and playing around with the things - the challenge of making something that actually works, and dealing with all of the unexpected practical problems that arise is the best way of really getting this.


Miguel @ 2022-11-30T14:19 (+2)

Thank you for this post, very relevant to what I'm researching - Goal Misgeneralization problem