What is an example of recent, tangible progress in AI safety research?

By Aaron Gertler 🔸 @ 2021-06-14T05:29 (+35)

Cross-posting a good question from Reddit. Answer there, here, or in both places; I'll make sure the Reddit author knows about this post.

Eric Herboso's answer on Reddit (the only one so far) includes these examples:

Scott Garrabrant on Finite Factored Sets (May)

Paul Christiano on his Research Methodology (March)

Rob Miles on Misaligned Mesa-Optimisers (Feb part 1 May part 2, both describing a paper from 2019)


CarlShulman @ 2021-06-15T04:11 (+12)

Focusing on empirical results:

Learning to summarize  from human feedback was good, for several reasons.

I liked the recent paper empirically demonstrating objective robustness failures hypothesized in earlier theoretical work on inner alignment.

 

Mark Xu @ 2021-06-15T04:50 (+4)

nit: link on "reasons" was pasted twice. For others it's https://www.lesswrong.com/posts/PZtsoaoSLpKjjbMqM/the-case-for-aligning-narrowly-superhuman-models

Also hadn't seen that paper. Thanks!

technicalities @ 2021-06-16T06:51 (+5)

If we take "tangible" to mean executable:

But as Kurt Lewin once said "there's nothing so practical as a good theory". In particular, theory scales automatically and conceptual work can stop us from wasting effort on the wrong things.


Also Everitt et al (2019) is both: a theoretical advance with good software.

technicalities @ 2021-06-16T12:48 (+5)

Not recent-recent, but I also really like Carey's 2017 work on CIRL. Picks a small, well-defined problem and hammers it flush into the ground. "When exactly does this toy system go bad?"