What is an example of recent, tangible progress in AI safety research?

By Aaron Gertler 🔸 @ 2021-06-14T05:29 (+35)

Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.

Eric Herboso's answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)

CarlShulman @ 2021-06-15T04:11 (+12)

Focusing on empirical results:

Learning to summarize from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.

Mark Xu @ 2021-06-15T04:50 (+4)

nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

technicalities @ 2021-06-16T06:51 (+5)

If we take "tangible" to mean executable:

A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
Carey's 2019 proof of concept / extension of quantilizers
Stiennon et al (2020) is an extremely encouraging example of a large negative "alignment tax" (making it safer also made it work better)

But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.

CAIS (2019) pivots away from the classic agentic model, maybe for the better
The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
The Armstrong/Shah discussion of value learning changed my research direction for the better.

Also Everitt et al (2019) is both: a theoretical advance with good software.

technicalities @ 2021-06-16T12:48 (+5)

Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"