MATS Applications + Research Directions I'm Currently Excited About

By Neel Nanda @ 2025-02-06T11:03 (+31)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
MarcusAbramovitch @ 2025-02-27T22:01 (+4)

From an outsider perspective, this looks like the sort of thing that almost anyone could get started on and I like the phrasing you used to signal that. AI progress moves so fast that you are most likely going to the only one looking at something and so you can do very basic things like 

"How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?"

It's pretty easy to imagine taking 1 line of CoT and regenerating and then 2 lines...

I think a lot of people can just do this and getting to do it under Neel Nanda is likely to lead to a high quality paper.

ZY @ 2025-02-08T16:16 (+3)

I previously did some work on model diffing (base vs chat models) on llama2, llama3 and mistral (as they have similar architectures) for the final project of AISES(https://www.aisafetybook.com/), and found some interesting patterns;

https://docs.google.com/presentation/d/1s-ymk45r_ekdPAdCHbX1hP5ZaAPb82ta/edit#slide=id.p3

Planning to explore more and expand ; Welcome any thoughts/comments/discussions
 

SummaryBot @ 2025-02-06T21:23 (+1)

Executive summary: The post announces open summer MATS applications and outlines several exciting research directions in mechanistic interpretability, including understanding thinking models, advancing sparse autoencoders, exploring model diffing, investigating safety-relevant behaviors, promoting practical interpretability projects, and examining fundamental assumptions in the field.

Key points:

  1. Summer MATS applications are now open for supervising mechanistic interpretability projects, with a submission deadline of February 28.
  2. Interest in studying thinking models that generate extensive chains of thought to unravel their reasoning processes and assess their determinism and safety.
  3. Continued focus on Sparse Autoencoders (SAEs) to identify and address fundamental issues, improve interpretability techniques, and explore alternative decomposition methods.
  4. Exploration of model diffing to understand changes during finetuning, which could provide insights into alignment and model behavior modifications.
  5. Investigation of sophisticated and safety-relevant behaviors in large language models, such as alignment faking and user attribute modeling, highlighting the need for advanced interpretability tools.
  6. Promotion of practical interpretability projects that tackle real-world tasks and challenge existing baselines, alongside a critical examination of foundational assumptions in mechanistic interpretability.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.