An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers

By Neel Nanda @ 2022-10-18T21:23 (+19)

This is a linkpost to https://www.neelnanda.io/mechanistic-interpretability/favourite-papers

Introduction

This is an extremely opinionated list of my favourite mechanistic interpretability papers, annotated with my key takeaways and what I like about each paper, which bits to deeply engage with vs skim (and what to focus on when skimming) vs which bits I don’t care about and recommend skipping, along with fun digressions and various hot takes.

This is aimed at people trying to get into the field of mechanistic interpretability (especially Large Language Model (LLM) interpretability). I’m writing it because I’ve benefited a lot by hearing the unfiltered and honest opinions from other researchers, especially when first learning about something, and I think it’s valuable to make this kind of thing public! On the flipside though, this post is explicitly about my personal opinions - I think some of these takes are controversial and other people in the field would disagree.

The four top level sections are priority ordered, but papers within each section are ordered arbitrarily - follow your curiosity

Priority 1: What is Mechanistic Interpretability?

Priority 2: Understanding Key Concepts in the field

Priority 3: Expanding Understanding

Language Models

Algorithmic Tasks

Image Circuits

Priority 4: Bonus