Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

By Max Nadeau, Xander123, Buck, Nate Thomas @ 2022-10-27T01:39 (+95)

This winter, Redwood Research is running a coordinated research effort on mechanistic interpretability of transformer models. We’re excited about recent advances in mechanistic interpretability and now want to try to scale our interpretability methodology to a larger group doing research in parallel. 

REMIX participants will work to provide mechanistic explanations of model behaviors, using our causal scrubbing methodology to formalize and evaluate interpretability hypotheses. We hope to produce many more explanations of model behaviors akin to our recent work investigating behaviors of GPT-2-small, toy language models, and models trained on algorithmic tasks. We think this work is a particularly promising research direction for mitigating existential risks from advanced AI systems (more in Goals and FAQ). 

Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field. We also think participants will learn skills valuable for many styles of interpretability research, and also for ML research more broadly.

Apply here by Sunday, November 13th [DEADLINE EXTENDED] to be a researcher in the program. Apply sooner if you’d like to start early (details below) or receive an earlier response. 

Some key details:

Feel free to email programs@rdwrs.com with questions.

Goals: 

Why you should apply:

Research results. We are optimistic about the research progress you could make during this program (more below). Since mechanistic interpretability is currently a small sub-field of machine learning, we think it’s plausible that REMIX participants could make important discoveries that significantly further the field.

Skill-building. We think this is a great way to gain experience working with language models and interpreting/analyzing their behaviors. The skills you’ll learn in this program will be valuable for many styles of interpretability research, and also for ML research more broadly.

Financial support & community. This is a paid opportunity, and a chance to meet and connect with other researchers interested in interpretability.

 

Why we’re doing this:

Research output. We hope this program will produce research that is useful in multiple ways:

Training and hiring. We might want to hire people who produce valuable research during this program. Also, we expect alumni will learn skills that are valuable for other styles of interpretability research.

Experience running large collaborative research projects. It seems plausible that at some point it will be useful to run a huge collaborative alignment project. We’d like to practice this kind of thing, in the hope that the lessons learned are useful to us or others. 

See “Is this research promising enough to justify running this program?” and “How useful is this kind of interpretability research for understanding models that might pose an existential risk?”

Why do this now?

We think our recent progress in interpretability makes it a lot more plausible for us to reliably establish mechanistic explanations of model behaviors, and therefore get value from a large, parallelized research effort.

A unified framework for specifying and validating explanations. Previously, a big bottleneck on parallelizing interpretability research across many people was the lack of a clear standard of evidence for proposed explanations of model behaviors (which made us expect the research produced to be pretty unreliable). We believe we’ve recently made some progress on this front, developing an algorithm called “causal scrubbing” which allows us to automatically derive an extensive set of tests for a wide class of mechanistic explanations. This algorithm is only able to reject hypotheses rather than confirming them, but we think that this still makes it way more efficient to review the research produced by all the participants. 

Improved proofs of concept. We now have several examples where we followed our methodology and were able to learn a fair bit about how a transformer was performing some behavior.

Tools that allow complicated experiments to be specified quickly. We’ve built a powerful library for manipulating neural nets (and computational graphs more generally) for doing intervention experiments and getting activations out of models. This library allows us to do experiments that would be quite error-prone and painful with other tools.

Who should apply?

We're most excited about applicants comfortable working with (basic) Python, any of PyTorch/TensorFlow/Numpy, and linear algebra. Quickly generating hypotheses about model mechanisms and testing them requires some competence in these domains.

If you don’t understand the transformer architecture, we’ll require that you go through preparatory materials, which explain the architecture and walk you through building one yourself.

We’re excited about applicants with a range of backgrounds; prior experience in interpretability research is not required. The primary activity will be designing, running, and analyzing results from experiments which you hope will shed light on how a model accomplishes some task, so we’re excited about applicants with experience doing empirical science in any field (e.g. economics, biology, physics). The core skill we’re looking for here, among people with the requisite coding/math background, is something like rigorous curiosity: a drive to thoroughly explore all the ways the model might be performing some behavior and narrow them down through careful experiments.

What is doing this sort of research like?

Mechanistic interpretability is an unusual empirical scientific setting in that controlled experimentation is relatively easy, but there’s relatively little knowledge about the kinds of structures found in neural nets.

Regarding the ease of experimentation: 

Regarding the openness of the field:

REMIX participants pursue interpretability research akin to the investigations Redwood has done recently into induction headsindirect object identification (IOI) in small language models, and balanced parenthesis classification, all of which will be released publically soon. You can read more about behavior selection criteria here.

The main activities will be:

The mechanisms for behaviors we’ll be studying are often surprisingly complex, so careful experimentation is needed to accurately characterize them. For example, the Redwood researchers investigating the IOI behavior found that removing the influence of the circuit they identified as primarily responsible had surprisingly little effect on the model’s ability to do IOI. Instead, other heads in the model substantially changed their behavior to compensate for the excision. As the researchers write, “Both the reason and the mechanism of this compensation effect are still unclear. We think that this could be an interesting phenomenon to investigate in future work.” 

Here’s how a Redwood researcher describes this type of research:

It feels a lot of time like you're cycling between: "this looks kind of weird and interesting, not sure what's up with this" and then "I have some vague idea about what maybe this part is doing, I should come up with a test to see if I understand I correctly" and then once you have a test "oh cool, I was kind of right but also kind of wrong, why was I wrong" and then the cycle repeats.

Often it's pretty easy to have a hunch about what some part of your model is doing, but finding a way to appropriately test that hunch is hard and often your hunch might be partially correct but incomplete so your test may rule it out prematurely if you're not careful/specific enough.

It feels like you're in a lab, with your model on the dissection table, and you're trying to pick apart what's going on with different pieces and using different tools to do so - this feels really cool to me, kind of like trying to figure out what's going on with this alien species and how it can do the things it does.

It's really fun to try and construct a persuasive argument for your results: "I think this is what's happening because I ran X, Y, Z experiments that show A, B, C, plus I was able to easily generate adversarial examples based on these hypotheses" - I often feel like there's some sort of imaginary adversary (sometimes not imaginary!) that I have to convince of my results and this makes it extremely important that I make claims that I can actually back up and that I appropriately caveat others.

This research also involves a reasonable amount of linear algebra and probability theory. Researchers will be able to choose how deep they want to delve into some of the trickier math we’ve used for our interpretability research (for example, it turns out that one technique we’ve used is closely related to Wick products and Feynman diagrams).

Schedule

The program will start out with a week of training using our library for computational graph rewrites and investigating model behaviors using our methodology. This week will have a similar structure to MLAB (our machine learning bootcamp), with pair programming and a prepared curriculum. We’re proud to say that past iterations of MLAB have been highly-reviewed – the participants in the second iteration gave an average score of 9.2/10 to the question “How likely are you to recommend future MLAB programs to a friend/colleague?”. 

An approximate schedule for week one:

In future weeks, you’ll split your time between investigating behaviors in these models, communicating your findings to the other researchers, and reading/learning from/critiquing other researchers’ findings.

Miscellaneous Notes

FAQ

What if I can’t make these dates? 

We encourage you to submit an application even if you can’t make the dates; we have some flexibility, and might make exceptions for exceptional applicants. We’re planning to have some participants start as soon as possible to test drive our materials, practice in our research methodology, and generally help us structure this research program so it goes well.

Am I eligible if I’m not sure I want to do interpretability (or other AI alignment) research long-term/am not interested in Effective Altruism?

Yes.

What’s the application process?

You fill out the form, complete some TripleByte tests that assess your programming abilities, then do an interview with us.

Can you sponsor visas?

Given this program is a research sprint rather than a purely educational program, and given the fact that we plan to offer stipends for participants, we can’t guarantee sponsorship of the right-to-work visas required for international participants to be in person. If you are international but studying at a US university, we are optimistic about getting a CPT for you to be able to participate.

However, we still encourage international candidates to apply. We’ll try to evaluate on a case-by-case basis and for exceptional candidates depending on your circumstances, there may be alternatives, like trying to sponsor a visa to have your join later or participating remotely for some period. 

Is this research promising enough to justify running this program?

Buck’s opinion:

I would love to say that this project is paid for by the expected direct value of its research output. My inside view is that the expected direct value does in fact pay for the dollar cost of this project, and probably even the time cost of the organizers. However, there are strong reasons for skepticism–this is a pretty weird thing to do, and it’s sort of weird to be able to make progress on things by having a large group of people work together. So the decision to run this program is to some extent determined by more boring, capacity-building considerations, like training people and getting experience with large projects.

How useful is this kind of interpretability research for understanding models that might pose an existential risk?

This research might end up not being very useful. Here’s Buck’s description of some reasons why this might be the case:

My main concern is that the language model interpretability research we mentioned above was done on model behaviors which were specifically selected because we thought interpreting these behaviors would be easy. (I’ve recently been calling this kind of research “streetlight interpretability”, as in the classic fallacy where you only look for things in the place that’s easiest to look in.) These model behaviors are chosen to be unrepresentatively easy to interpret.

In particular, it’s not at all obvious how to use any existing interpretability techniques (or even how to articulate the goal of interpretability) in situations where the algorithm the model is using is poorly described by simple, human-understandable heuristics. I suspect that tasks like IOI or acronym generation, where the model implements a simple algorithm, are the exception rather than the rule, and models achieve good performance on their training trask mostly by using huge piles of correlations and heuristics. Our preliminary attempts to characterize how the model distinguishes between outputting “is” and “was” indicate that it relies on a huge number of small effects; my guess is that this is more representative of what language models are mostly doing than e.g. the IOI work.

So my guess is that this research direction (where we try to explain model behaviors in terms of human-understandable concepts) is limited, and is more like diving into a part of the problem that I strongly suspect to be solvable, rather than tackling the biggest areas of uncertainty or developing the techniques that might greatly expand the space of model behavior that we can understand. We are also pursuing various research directions that might make a big difference here; I think that research on these improved strategies is quite valuable (plausibly the best research direction), but I think that streetlight interpretability still looks pretty good.

Another concern you might have is that it’s useful to have a few examples of this kind of streetlight interpretability, but there are steeply diminishing marginal returns from doing more work of this type. For what it’s worth, I have so far continued to find it useful/interesting to see more examples of research in this style, but it’s pretty believable that this will slow down after ten more projects of this form or something.

Overall, we think that mechanistic interpretability is one of the most promising research directions for helping prevent AI takeover. Our hope is that mature interpretability techniques will let us distinguish between two ML systems that each behave equally helpfully during training – even having exactly the same input/output behavior on the entire training dataset – but where one does so because it is deceiving us and the other does so “for the right reasons.”[1]

Our experience has been that explaining model behaviors supports both empirical interpretability work – guiding how we engineer interpretability tools and providing practical knowhow – and theoretical interpretability work – for example, leading to the development of the causal scrubbing algorithm. We expect many of the practical lessons that we might learn would generalize to more advanced systems, and we expect that addressing the theoretical questions that we encounter along the way would lead to important conceptual progress.

Currently, almost no interesting behaviors of ML models have been explained – even for models that are tiny compared with frontier systems. We have been working to change this, and we’d like you to help.

For other perspectives on this question, see Chris Olah’s description of the relevance of thorough, small-model interpretability here and Paul Christiano’s similar view here.

Apply here by November 8th to be a researcher in the program, and apply sooner if you want to start ASAP. Sooner applications are also more likely to receive sooner responses. Email programs@rdwrs.com with questions.
 

  1. ^

     The problem of distinguishing between models which behave identically on the training distribution is core to the ELK problem.


Angelina Li @ 2022-11-05T23:08 (+9)

How much is the stipend for this (or the approx. range if you can't say exactly)? I wanted to source that when sending this to a friend and couldn't find the information above

Dan Valentine @ 2022-10-27T12:10 (+4)

Are there any plans for another MLAB in the near future? I ask because the training week of this program is similar, so I’m wondering if this is replacing MLAB.

Max Nadeau @ 2022-10-27T18:00 (+3)

No concrete plans one way or the other.

Eli Rose @ 2022-11-03T01:19 (+1)

Exactly when does the program begin? I couldn't find this info above.

Max Nadeau @ 2022-11-04T02:10 (+1)

From the post: "We plan to have some researchers arrive early, with some people starting as soon as possible. The majority of researchers will likely participate during the months of December and/or January."