The case for becoming a black-box investigator of language models

By Buck @ 2022-05-06T14:37 (+90)

[Cross-posted from the Alignment Forum.]

Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals.

I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn’t shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don’t love math or computer science.

I'd probably fund people to spend at least a few months on this; email me if you want to talk about it.

Some main activities I’d do if I was a black-box LM investigator are:

The skills you’d gain seem like they have a few different applications to alignment:

My guess is that this work would go slightly better if you had access to someone who was willing to write you some simple code tools for interacting with models, rather than just using the OpenAI playground. If you start doing work like this and want tools, get in touch with me and maybe someone from Redwood Research will build you some of them.


Peter Wildeford @ 2022-05-06T22:56 (+11)

I’ve started doing a bunch of this and posting results to my Twitter.

gabriel_wagner @ 2022-05-06T20:14 (+8)

Nice post!

Do you think a person working on this should also have some basic knowledge of ML? Or might it be better to NOT have that, to have a more "pure" outsider view on the behaviour of the models?

Buck @ 2022-05-06T21:44 (+3)

I think that knowing a bit about ML is probably somewhat helpful for this but not very important.

Girish_Sastry @ 2022-05-06T18:32 (+6)

I'd also be interested in funding activities like this. This could  inform how much we can learn about models without distributing weights.

tugbazsen @ 2022-05-20T18:50 (+4)

I'm not very well-versed in the CS aspect of ML or AI, but I really enjoyed reading about Redwood's work and reading this post reminded me of something I found striking then. I am not a trained EFL teacher but I have a decent grasp of some of the theory and some experience with classroom observation at different levels and teaching/tutoring EFL, and the examples in your "Redwood Research’s current project" write up are very similar in a lot of ways to mistakes intermediate or almost proficient EFL speakers would make (not being able to track  what pronouns are referring to across longer bodies of texts, not being able to grasp the implications of certain verbs when applied to certain objects etc). This makes me think that getting both language acquisition experts and EFL researchers' perspectives on your data may also be interesting and useful to this kind of research.

Joe Pusey @ 2022-05-07T21:15 (+1)

Super interesting topic- I sent you a message Buck :)