What is an example of recent, tangible progress in AI safety research?
By Aaron Gertler 🔸 @ 2021-06-14T05:29 (+35)
Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.
Eric Herboso's answer on Reddit (the only one so far) includes these examples:
Scott Garrabrant on Finite Factored Sets (May)
Paul Christiano on his Research Methodology (March)
Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)
CarlShulman @ 2021-06-15T04:11 (+12)
Focusing on empirical results:
Learning to summarize  from human feedback was good, for several reasons.
I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.
 
Mark Xu @ 2021-06-15T04:50 (+4)
nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models
Also hadn't seen that paper. Thanks!
technicalities @ 2021-06-16T06:51 (+5)
If we take "tangible" to mean executable:
- A primitive prototype and a framework for safety via debate (2018-9). Bit quiet since.
- Carey's 2019 proof of concept / extension of quantilizers
- Stiennon et al (2020) is an extremely encouraging example of a large negative "alignment tax" (making it safer also made it work better)
 
But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.
- CAIS (2019) pivots away from the classic agentic model, maybe for the better
- The search for mesa-optimisers (2019) is a step forward from previous muddled thoughts on optimisation, and they make predictions we can test them on soon.
- The Armstrong/Shah discussion of value learning changed my research direction for the better.
Also Everitt et al (2019) is both: a theoretical advance with good software.
technicalities @ 2021-06-16T12:48 (+5)
Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"