Concrete Steps to Get Started in Transformer Mechanistic Interpretability

By Neel Nanda @ 2022-12-26T13:00 (+18)

This is a linkpost to www.neelnanda.io/mechanistic-interpretability/getting-started

Disclaimer: This post mostly links to resources I've made. I feel somewhat bad about this, sorry! Transformer MI is a pretty young and small field and there just aren't many people making educational resources tailored to it. Some links are to collations of other people's work, and I link to more in the appendix.

Introduction

Feel free to just skip the intro and read the concrete steps

The point of this post is to give concrete steps for how to get a decent level of baseline knowledge for transformer mechanistic interpretability (MI). I try to give concrete, actionable, goal-oriented advice that I think is enough to get decent outcomes - please take this as a starting point and deviate if something feels a better fit to your background and precise goals!

A core belief I have about learning mechanistic interpretability is that you should spend at least a third of your time writing code and playing around with model internals, not just reading papers. MI has great feedback loops, and a large component of the skillset is the practical, empirical skill of being able to write and run experiments easily. Unlike normal machine learning, once you have the basics of MI down, you should be able to run simple experiments on small models within minutes, not hours or days. Further, because the feedback loops are so tight, I don’t think there’s a sharp boundary between reading and doing research. If you want to deeply engage with a paper, you should be playing with the model studied, and testing the paper’s basic claims. 

The intended audience is for people new-ish to Mechanistic Interpretability but who know they want to learn about it - if you have no idea what MI is, check out this Mech Interp ExplainerCircuits: Zoom In or Chris Olah’s overview post.

Defining “Decent Baseline”

Here’s an outline of what I mean when I say “a decent level of baseline knowledge”

Getting the Fundamentals

A set of goal oriented steps that I think are important for getting the fundamental skills. If you feel comfortable meeting the success criteria of a step, feel free to skip it.

Callum McDougall has made a great set of tutorials for mechanistic interpretability and TransformerLens, with exercises, solutions and beautiful diagrams. This section is in large part an annotated guide to those!

  1. Learn general Machine Learning prerequisites: There's a certain baseline level of understanding about ML in general that's important context. It's also important to be familiar with an ML framework like PyTorch to actually write code, and to help ground your knowledge.
    • Resource: Read the Barebones Guide to Mechanistic Interpretability Prerequisites, and learn the pre-reqs you're missing (see step 2 for more on transformers)
    • Success criteria: Write and train an MLP in PyTorch to solve MNIST
      • I have deliberately set this success criteria to emphasise that your goal here is to get enough intuition and context that you can learn more. The goal here is not to get a deep knowledge, or to understand all of the sub-fields - ML contains a ton of niches, and this is not on the critical path to exploring MI! 
    • Tips:
      • Try PyTorch over other frameworks (Jax, Tensorflow, etc) if you’re doing MI. It is best to Einops and einsum for tensor manipulation - if you don’t use these you’ll introduce a lot of subtle bugs.
      • It's very easy to overestimate how much you need to learn general pre-reqs - when in doubt, move on to MI, and go back when you notice something missing! 
  2. Deeply understand transformers: A key skill in MI is to have a gears-level model in your head of the transformer - what the parameters and activations are and mean, what are all of the moving parts and how do they fit together. You want to build some intuition for what kinds of algorithms a transformer can/cannot implement. 
  3. Familiarise yourself with Mechanistic Interpretability (MI) Tooling and basics: You want to be able to easily write and run experiments as you learn and explore MI, so it's worth some up front investment to learn what's out there. The goal here is familiarity with the basics, not deep expertise - the best way to really learn tooling is by using it in practice to solve problems that you care about.
    • ResourceCallum McDougall’s tutorials for TransformerLens + Induction Heads
      • If you’re new to ML coding, I recommend doing your work in a Colab Notebook (with a GPU) unless you’re confident you know what you’re doing. You don’t want to be wasting time setting up infrastructure!
    • Success criteria: Work your way through the tutorial, compare your work to the answers, and check you understand 80% of the solutions
      • A core goal here is to learn how to use the TransformerLens library for doing mechanistic interpretability of GPT-style language models. 
        • Bonus: Use TransformerLens to load GPT-2 Small in a Colab notebook, run the model, and visualise the activations.
    • Bonus: Data visualization: A core skill in MI is being good at visualizing data. Neural networks are high dimensional objects, and you need to be able to understand what's going on! My research workflow looks like running an experiment, visualizing the data, staring at the data, being confused, forming more hypotheses, and iterating. Plot data often, and in a diversity of ways.
      • Familiarise yourself with a plotting library. My personal favourite is Plotly, but MatplotlibBokeh and Holoview are other options.
        • Some concrete goals: Be able to plot several loss curves on the same plot, make a scatter plot, and display a matrix (imshow), and add axis labels, titles, and line labels.
        • Callum McDougall has a great Plotly intro
        • Matplotlib is the most popular and easiest to google/use ChatGPT or Copilot for. But personally I find it really annoying, unintuitive and limiting. 
      • Play around with Alan Cooney's CircuitsVis library, which lets you pass in tensors to an interactive visualization and show it in a Jupyter notebook. See the existing visualizations here
        • You can write your own visualizations in React, but it's higher effort
    • Bonus: Other tutorials on mechanistic interpretability, to dig more into things. Just pick any of these that sound interesting, and move on if none of them feel shiny:
  4. (Optional) Learn key concepts in MI: Get your head around basic concepts in MI, and get an overview of the field. The goal of this step is a whirlwind overview, not to get a deep and perfect knowledge of everything!
    • Resource: A Comprehensive Mechanistic Interpretability Explainer & Glossary
    • Success criteria: Read through the whole explainer, and be able to follow the gist of most of it. 
      • Feel free to skip or skim sections or definitions that you don’t follow or find interesting.
      • Concrete goal: Be able to look at each section in the table of contents, and for 80% have a rough intuition for what that section is about, why you might care about it, and what you should go back to that section for

Paths for Further Exploration

A slightly less concrete collection of different strategies to further explore the field. These are different options, not a list to follow in order. Read all of them, notice if one jumps out at you, and dive into that. If nothing jumps out, start with the “exploring and building on a paper” route. If you spend several weeks in one of these modes and feel stuck, you should zoom out and try one of the other modes for a few days! Further, it’s useful to read through all of the sections, the tips and resources in one often transfer!

Explore and Build On A Paper

Find a paper you’re excited about and try to really deeply engage with it, this means both running experiments and understanding the ideas, and ultimately trying to replicate (most of) it. And if this goes well, it can naturally transition into building on the paper’s ideas, exploring confusions and dangling threads, and into interesting original research.

Work on a Concrete Problem

Find a problem in MI that you’re excited about, and go and try to solve it! Note: A good explicit goal is learning and having fun, rather than doing important research. Doing important original research is just really hard, especially as a first project without significant mentorship! Note: I expect this to be a less gentle introduction than the other two paths, only do this one first if it feels actively exciting to you! Check out this blog post and this walkthrough for accounts of what the actual mech interp research process can look like.

Read Around the Field (Ie Do a Lit Review)

Go and read around the field. Familiarise yourself with the existing (not that many!) papers in MI, try to understand what’s going on, big categories of open problems, what we do and do not know. Focus on writing code as you read and aim to deeply understand a few important papers rather than skimming many. 

Appendix

Advice on compute

Other Resources

See Apart’s Interpretability Playground for a longer list, and a compilation of my various projects here