Shallow review of live agendas in alignment & safety

By Gavin @ 2023-11-27T11:33 (+76)

Summary

You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on. 

This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”. 

This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one. You can find even more deleted material here

Most of you should only read the editorial and skim the section you work in.

Our taxonomy:

  1. Understand existing models (evals, interpretability, science of DL)
  2. Control the thing (prevent deception, model edits, value learning, goal robustness)
  3. Make AI solve it (scalable oversight, cyborgism, etc)
  4. Theory (galaxy-brained end-to-end, agency, corrigibility, ontology, cooperation)
     

We don’t distinguish between massive labs, individual researchers, and sparsely connected networks of people working on similar stuff. The funding amounts and full time employee estimates might be a reasonable proxy.

The categories we chose have substantial overlap and see the “see also”s for closely related work.

Please point out if we mistakenly round one thing off to another, miscategorise someone, or otherwise state or imply falsehoods. We will edit.

Unlike the late Larks reviews, we’re not primarily aiming to direct donations. But if you enjoy reading this, consider donating to ManifundMATS, or LTFF, or to Lightspeed for big ticket amounts: some good work is bottlenecked by money, and you have free access to the service of specialists in giving money for good work.
 

Meta

When I (Gavin) got into alignment (actually it was still ‘AGI Safety’) people warned me it was pre-paradigmatic. They were right: in the intervening 5 years, the live agendas have changed completely.[1] So here’s an update. 

I wanted this to be a straight technical alignment doc, but people pointed out that would exclude most work (e.g. evals and nonambitious interpretability, which are safety but not alignment) so I made it a technical AGI safety doc. Plus ça change.

The only selection criterion is “I’ve heard of it and >= 1 person was recently working on it”. I don’t go to parties so it’s probably a couple months behind. 

Obviously this is the Year of Governance and Advocacy, but I exclude all this good work: by its nature it gets attention. I also haven’t sought out the notable amount by ordinary labs and academics who don’t frame their work as alignment. Nor the secret work.

Chekhov’s evaluation: I include Yudkowsky’s operational criteria (Trustworthy command?, closure?, opsec?, commitment to the common good?, alignment mindset?) but don’t score them myself. The point is not to throw shade but to remind you that we often know little about each other. 

You are unlikely to like my partition into subfields; here are others.

No one has read all of this material, including us. Entries are based on public docs or private correspondence where possible but the post probably still contains >10 inaccurate claims. Shouting at us is encouraged. If I’ve missed you (or missed the point), please draw attention to yourself. See you in 5 years.
 

Editorial 

Agendas

1. Understand existing models

characterisation
 

Evals

(Figuring out how a trained model behaves.)
 

Various capability evaluations

Various red-teams

Eliciting model anomalies 

Alignment of Complex Systems: LLM interactions

The other evals (groundwork for regulation)

Much of Evals orgs and Governance teams’ work is something else: developing politically legible metricsprocessesshocking case studies. The aim is to motivate and underpin actually sensible regulation. 

This is important work – arguably the highest-leverage in the very-short-term. But this is a technical alignment post. I include this section to emphasise that these other evals are different from understanding how dangerous capabilities have or might emerge.
 

Interpretability 

(Figuring out what a trained model is actually computing.)[2]
 

Ambitious mech interp

Concept-based interp 

Causal abstractions

EleutherAI interp

Activation engineering (as unsupervised interp)

Leap

Understand learning

(Figuring out how the model figured it out.)
 

Timaeus: Developmental interpretability & singular learning theory 

Levoso Algorithm Interpretability

Various other efforts:


2. Control the thing

(Figuring out how to predictably affect model behaviour.)
 

Prosaic alignmentalignment by default 

Redwood: control evaluations

 

Safety scaffolds

 

Prevent deception 

Through methods besides mechanistic interpretability.
 

Redwood: mechanistic anomaly detection

Indirect deception monitoring 

Anthropic: externalised reasoning oversight

Surgical model edits

(interventions on model internals)
 

Activation engineering 

Getting it to learn what we want

(Figuring out how to control what the model figures out.)
 

Social-instinct AGI

 

Imitation learning

Reward learning 

Goal robustness 

(Figuring out how to make the model keep doing ~what it has been doing so far.)
 

Measuring OOD

Concept extrapolation 

Mild optimisation


3. Make AI solve it

(Figuring out how models might help with figuring it out.)
 

OpenAI: Superalignment 

Supervising AIs improving AIs

Cyborgism

 

See also Simboxing (Jacob Cannell).
 

Scalable oversight

(Figuring out how to ease humans supervising models. Hard to cleanly distinguish from ambitious mechanistic interpretability but here we are.)
 

Task decomp

Recursive reward modelling is supposedly not dead but instead one of the tools Superalignment will build.

Another line tries to make something honest out of chain of thoughttree of thought.
 

Elicit (previously Ought)

Adversarial 

Deepmind Scalable Alignment

AnthropicNYU Alignment Research Group / Perez collab

 

See also FAR (below).


4. Theory 

(Figuring out what we need to figure out, and then doing that. This used to be all we could do.)
 

Galaxy-brained end-to-end solutions
 

The Learning-Theoretic Agenda 

Open Agency Architecture

Provably safe systems

Conjecture: Cognitive Emulation (CoEms)

Question-answer counterfactual intervals (QACI)

Understanding agency 

(Figuring out ‘what even is an agent’ and how it might be linked to causality.)
 

Causal foundations

Alignment of Complex Systems: Hierarchical agency

The ronin sharp left turn crew 

Shard theory

boundaries / membranes

disempowerment formalism

Performative prediction

Understanding optimisation

Corrigibility

(Figuring out how we get superintelligent agents to keep listening to us. Arguably scalable oversight and superalignment are ~atheoretical approaches to this.)


Behavior alignment theory 

The comments in this thread are extremely good – but none of the authors are working on this!! See also Holtman’s neglected result. See also EJT (and formerly Petersen). See also Dupuis.

 

Ontology identification 

(Figuring out how superintelligent agents think about the world and how we get superintelligent agents to actually tell us what they know. Much of interpretability is incidentally aiming at this.)
 

ARC Theory 

Natural abstractions 

Understand cooperation

(Figuring out how inter-AI and AI/human game theory should or would work.)
 

CLR 

FOCAL 

 

See also higher-order game theory. We moved CAIF to the “Research support” appendix. We moved AOI to “misc”.


5. Labs with miscellaneous efforts

(Making lots of bets rather than following one agenda, which is awkward for a topic taxonomy.)
 

 Deepmind Alignment Team 

Apollo

Anthropic Assurance / Trust & Safety / RSP Evaluations / Interpretability

FAR 

Krueger Lab

AI Objectives Institute (AOI)

 

See the appendices for even more, you glutton.


More meta

If you enjoyed reading this, consider donating to LightspeedMATSManifund, or LTFF: some good work is bottlenecked by money, and some people specialise in giving away money to enable it.

Conflicts of interest: one in expectation (I’ve applied for a LTFF grant for this doc but wrote the whole thing without funding). I often work with ACS and PIBBSS and have worked with Team Shard. CHAI once bought me a burrito. 

If you’re interested in doing or funding this sort of thing, get in touch at hi@arbresearch.com. I never thought I’d end up as a journalist, but stranger things will happen.


 

Thanks to Alex Turner, Neel Nanda, Jan Kulveit, Adam Gleave, Alexander Gietelink Oldenziel, Marius Hobbhahn, Lauro Langosco, Steve Byrnes, Henry Sleight, Raymond Douglas, Robert Kirk, Yudhister Kumar, Quratulain Zainab, Tomáš Gavenčiak, Joel Becker, Lucy Farnik, Oliver Hayman, Sammy Martin, Jess Rumbelow, Jean-Stanislas Denain, Ulisse Mini, David Mathers, Chris Lakin, Vojta Kovařík, Zach Stein-Perlman, and Linda Linsefors for helpful comments.


 

  1. ^

    Unless you zoom out so far that you reach vague stuff like “ontology identification”. We will see if this total turnover is true again in 2028; I suspect a couple will still be around, this time.

  2. ^

    > one can posit neural network interpretability as the GiveDirectly of AI alignment: reasonably tractable, likely helpful in a large class of scenarios, with basically unlimited scaling and only slowly diminishing returns. And just as any new EA cause area must pass the first test of being more promising than GiveDirectly, so every alignment approach could be viewed as a competitor to interpretability work. – Niplav


RyanCarey @ 2023-11-28T09:14 (+2)

Causal Foundations is probably 4-8 full-timers, depending on how you count the small-to-medium slices of time from various PhD students. Several of our 2023 outputs seem comparably important to the deception paper: 

Gavin @ 2023-11-28T10:07 (+2)

excellent, thanks, will edit

Jobst Heitzig (vodle.it) @ 2023-11-29T16:01 (+1)

Where in your taxonomy does the design of AI systems go – what high-level architecture to use (non-modular? modular with a perception model, world-model, evaluation model, planning model etc.?), what type of function approximators to use for the modules (ANNs? Bayesian networks? something else?), what decision theory to base it on, what algorithms to use to learn the different models occurring in these modules (RL? something else?), how to curate training data, etc.?

Gavin @ 2023-11-29T19:27 (+2)

It's not a separate approach, the non-theory agendas and even some of the theory agendas have their own answers to these questions. I can tell you that almost everyone besides CoEms and OAA are targeting NNs though.

Jobst Heitzig (vodle.it) @ 2023-11-29T21:46 (+1)

"targeting NNs" sounds like work that takes a certain architecture (NNs) as a given rather than work that aims at actively designing a system.

To be more specific: under the proposed taxonomy, where would a project be sorted that designs agents composed of a Bayesian network as a world model and an aspiration-based probabilistic programming algorithm for planning?

Gavin @ 2023-11-30T10:12 (+2)

Well there's a lot of different ways to design an NN.

That sounds related to OAA (minus the vast verifier they also want to build), so depending on the ambition it could be "End to end solution" or "getting it to learn what we want" or "task decomp". See also this cool paper from authors including Stuart Russell.

Jobst Heitzig (vodle.it) @ 2023-11-30T17:47 (+1)

What is OAA? And, more importantly: where now would you put it in your taxonomy?

Gavin @ 2023-12-01T09:03 (+2)

https://www.lesswrong.com/posts/pHJtLHcWvfGbsW7LR/roadmap-for-a-collaborative-prototype-of-an-open-agency

I put it in "galaxy-brained end-to-end solutions" for its ambition but there are various places it could go.