How did you update on AI Safety in 2023?

By Chris Leong @ 2024-01-23T02:21 (+30)

2023 was a massive year in AI. What updates did you make? This includes timelines, likelihood of various risks and/or alignment plans/strategies.

Chris Leong @ 2024-01-26T07:02 (+13)

Risks/P-doom:
• P(doom) went down as a result of the dramatic shift in the Overton window. I'd say a moderate, but not massive update, because timelines are looking shorter as well.
• No longer worried about the possibility that we (the AI Safety community) are all essentially a bunch "cranks" given the support of Geoffrey Hinton and Yoshua Bengio have voiced towards the importance of addressing these concerns.
• Updated towards being more worried about risks such as AI-supported bioterrorism, cyber-attacks and manipulation. There's been results that make me feel like we're on the verge of this becoming a thing and "slow"-takeoff is looking more likely, so there would be a greater cost to just tanking these issues. I still see alignment as the most important issue to focus on.
• In terms of outer alignment: Due to ChatGPT, much more optimistic about training an AI to behave reasonably in normal situations using RLHF; also more optimistic about using such techniques to tell AI's to behave conservatively in weird philosophical thought experiments. My main remaining worry related to outer alignment is figuring out corrigibility lest we produce an AI that works well in the current context, but can't adapt to new circumstances.

Timelines:
• Long-timeline worlds feel less likely. Even though some of the capabilities of ChatGPT/GPT 4 are not that surprising to people who were following capability progress closely, the longer things continue as they are, the less time there is for an unexpected slow-down to occur before we hit AGI.
• More optimistic about evals work delivering value, largely due to the openness of governments and companies to evals work.

Governance:
• Far more optimistic about policy than before due to the opening up of the Overton Window.
• More complicated feelings on a pause as a result of the AI Pause Debate in terms of now understanding that the logistics of making a pause work net-positive would be much more complicated than I first realised. I think that pushing for the pause to be part of the public conversation/one of the options considered, while not completely risk-free, is a pretty strong bet.
• Became a huge fan of the Tony Blair Institute's proposal for the UK to create an organisation called sentinel which would perform research in order to help figure out AI policy.
• More worried about the threat of open-source AI given how fast it is catching up with GPT4 and Facebook's decision to champion open-source.

Groups:
• Less confident in Sam Altman's leadership of OpenAI.
• More worried about e/acc vs. previously thinking that they were so unimportant that we should just ignore them vs. risking amplifying their profile.
• More optimistic about allying with people concerned about near-termist risks where our interests align (largely due to the impact of the FLI letter).

Technical alignment:
• More optimistic about the value of empirical alignment research and less optimistic about the value of agent foundations research.
• I now feel the field is mature enough that "workhorse" researchers can make a significant contribution (vs. before when creativity to discover new research directions seemed more vital).
• More optimistic about approaches that take advantage of the linearity in neural networks.
• More optimistic on interpretability progress (due to a number of results incl. dictionary learning resolving super-position).
• I spent a lot of time this year trying to read up about as many alignment proposals as possible. I now think it would have been better for me to have spent less time doing this and to have spent more time focusing on doing concrete work.

Movement-building:
• Movement-building work to grow the pool of applicants to programs like SERI MATS seems less important because these programs are much more competitive these days. May be better to attempt to increase mentorship opportunities or to focus on people who are more experienced in terms of research or AI.

Chris Leong @ 2024-01-27T11:22 (+2)

Looks like outer alignment is actually more difficult than I thought. Sherjil Ozair, a former Deepmind employee writes:

"From my experience doing early RLHF work for Gemini, larger models exploit the reward model more. You need to constantly keep collecting more preferences and retraining reward models to make it not exploitable. Otherwise you get nonsensical responses which have exploited the idiosyncracy of your preferences data. There is a reason few labs have done RLHF successfully"

In other words, even though we look at things like ChatGPT and go, "Wow, this is surprisingly aligned, I guess alignment is easier than we thought", we don't see all of the hard work that had to go into making it aligned. And perhaps as AI's become more powerful the amount of work required to align it will exceed what is humanly possible.

Nick K. @ 2024-01-26T08:32 (+1)

May I ask what your feelings on a pause were beforehand?

Chris Leong @ 2024-01-26T08:39 (+2)

I think I was likely 90% in favour of a 6 month pause mostly as a way to wake people up. I guess my main update from that debate was the difficulty of actually implementing a pause.

If you decide to start pushing for a pause, you don't get nuanced control over when the pause occurs (you likely have to start pushing for it at least a couple of years ahead of when you want it to occur). Further, it's quite likely that you accidentally reduce the amount of crunchtime by reducing the gap between the leading players and the rest. If this happens, a pause would likely be net-negative.

For an indefinite pause, it's unclear that you'll be able to unpause when necessary to avoid someone else front-running you, particularly because you might have to make alliances with people who will want to keep it paused.

So while it may still be worth pausing, it's very hard to get the details right so that it is robustly net-positive.

Jay Bailey @ 2024-01-24T05:00 (+6)

My p(doom) went down slightly (From around 30% to around 25%) mainly as a result of how GPT-4 caused governments to begin taking AI seriously in a way I didn't predict. My timelines haven't changed - the only capability increase of GPT-4 that really surprised me was its multimodal nature. (Thus, governments waking up to this was a double surprise, because it clearly surprised them in a way that it didn't surprise me!)

I'm also less worried about misalignment and more worried about misuse when it comes to the next five years, due to how LLM"s appear to behave. It seems that LLM's aren't particularly agentic by default, but can certainly be induced to perform agent-like behaviour - GPT-4's inability to do this well seems to be a capability issue that I expect to be resolved in a generation or two. Thus, I'm less worried about the training of GPT-N but still worried about the deployment of GPT-N. It makes me put more credence in the slow takeoff scenario.

This also makes me much more uncertain about the merits of pausing in the short-term, like the next year or two. I expect that if our options were "Pause now" or "Pause after another year or two", the latter is better. In practice, I know the world doesn't work that way and slowing down AI now likely slows down the whole timeline, which complicates things. I still think that government efforts like the UK's AISI are net-positive (I'm joining them for a reason, after all) but I think a lot of the benefit to reducing x-risk here is building a mature field around AI policy and evaluations before we need it - if we wait until I think the threat of misaligned AI is imminent, that may be too late.