MATS Applications + Research Directions I'm Currently Excited About

By Neel Nanda @ 2025-02-06T11:03 (+31)

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned *something *real

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!


  1. I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎


MarcusAbramovitch @ 2025-02-27T22:01 (+4)

From an outsider perspective, this looks like the sort of thing that almost anyone could get started on and I like the phrasing you used to signal that. AI progress moves so fast that you are most likely going to the only one looking at something and so you can do very basic things like 

"How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?"

It's pretty easy to imagine taking 1 line of CoT and regenerating and then 2 lines...

I think a lot of people can just do this and getting to do it under Neel Nanda is likely to lead to a high quality paper.

ZY @ 2025-02-08T16:16 (+3)

I previously did some work on model diffing (base vs chat models) on llama2, llama3 and mistral (as they have similar architectures) for the final project of AISES(https://www.aisafetybook.com/), and found some interesting patterns;

https://docs.google.com/presentation/d/1s-ymk45r_ekdPAdCHbX1hP5ZaAPb82ta/edit#slide=id.p3

Planning to explore more and expand ; Welcome any thoughts/comments/discussions
 

SummaryBot @ 2025-02-06T21:23 (+1)

Executive summary: The post announces open summer MATS applications and outlines several exciting research directions in mechanistic interpretability, including understanding thinking models, advancing sparse autoencoders, exploring model diffing, investigating safety-relevant behaviors, promoting practical interpretability projects, and examining fundamental assumptions in the field.

Key points:

  1. Summer MATS applications are now open for supervising mechanistic interpretability projects, with a submission deadline of February 28.
  2. Interest in studying thinking models that generate extensive chains of thought to unravel their reasoning processes and assess their determinism and safety.
  3. Continued focus on Sparse Autoencoders (SAEs) to identify and address fundamental issues, improve interpretability techniques, and explore alternative decomposition methods.
  4. Exploration of model diffing to understand changes during finetuning, which could provide insights into alignment and model behavior modifications.
  5. Investigation of sophisticated and safety-relevant behaviors in large language models, such as alignment faking and user attribute modeling, highlighting the need for advanced interpretability tools.
  6. Promotion of practical interpretability projects that tackle real-world tasks and challenge existing baselines, alongside a critical examination of foundational assumptions in mechanistic interpretability.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.