How the AI safety technical landscape has changed in the last year, according to some practitioners

By tlevin @ 2024-07-26T19:06 (+84)

I asked the Constellation Slack channel how the technical AIS landscape has changed since I last spent substantial time in the Bay Area (September 2023), and I figured it would be useful to post this (with the permission of the contributors to either post with or without attribution). Curious if commenters agree or would propose additional changes!

This conversation has been lightly edited to preserve anonymity.

Me: One reason I wanted to spend a few weeks in Constellation was to sort of absorb-through-osmosis how the technical AI safety landscape has evolved since I last spent substantial time here in September 2023, but it seems more productive to just ask here "how has the technical AIS landscape evolved since September 2023?" and then have conversations armed with that knowledge. The flavor of this question is like, what are the technical directions and strategies people are most excited about, do we understand any major strategic considerations differently, etc -- interested both in your own updates and your perceptions of how the consensus has changed!

Zach Stein-Perlman: Control is on the rise

Anonymous 1: There are much better “model organisms” of various kinds of misalignment, e.g. the stuff Anthropic has published, some unpublished Redwood work, and many other things

Neel Nanda: Sparse Autoencoders are now a really big deal in mech interp and where a lot of the top teams are focused, and I think are very promising, but have yet to conclusively prove themselves at beating baselines in a fair fight on a real world task

Neel Nanda: Dangerous capability evals are now a major focus of labs, governments and other researchers, and there's clearer ways that technical work can directly feed into governance

(I think this was happening somewhat pre September, but feels much more prominent now)

Anonymous 2: Lots of people (particularly at labs/AISIs) are working on adversarial robustness against jailbreaks, in part because of RSP commitments/commercial motivations. I think there's more of this than there was in September.

Anonymous 1: Anthropic and GDM are both making IMO very sincere and reasonable efforts to plan for how they’ll make safety cases for powerful AI.

Anonymous 1: In general, there’s substantially more discussion of safety cases

Anonymous 2: Since September, a bunch of many-author scalable oversight papers have been published, e.g. this, this, this. I haven't been following this work closely enough to have a sense of what update one should make from this, and I've heard rumors of unsuccessful scalable oversight experiments that never saw the light of day, which further muddies things

Anonymous 3: My impression is that infosec flavoured things are a top ~3 priority area a few more people in Constellation than last year (maybe twice as many people as last year??).

Building cyberevals and practically securing model weights at frontier labs seem to be the main project areas people are excited about (followed by various kinds of threat modelling and security standards).

Chris Leong @ 2024-07-27T02:48 (+7)

I don’t know the exact dates, but: a)proof-based methods seem to be receiving a lot of attention b) def/acc is becoming more of a thing c) more focus on concentration of power risk (tbh, while there are real risks here, I suspect most work here is net-negative)