ARC Evals: Responsible Scaling Policies

By Zach Stein-Perlman @ 2023-09-28T04:30 (+16)

This is a linkpost to https://evals.alignment.org/blog/2023-09-26-rsp/

We’ve been consulting with several parties1 on responsible scaling2 policies (RSPs). An RSP specifies what level of AI capabilities an AI developer is prepared to handle safely with their current protective measures, and conditions under which it would be too dangerous to continue deploying AI systems and/or scaling up AI capabilities until protective measures improve.

We think RSPs are one of the most promising paths forward for reducing risks of major catastrophes from AI. We’re excited to advance the science of model evaluations to help labs implement RSPs that reliably prevent dangerous situations, but aren’t unduly burdensome and don’t prevent development when it’s safe.

This page will explain the basic idea of RSPs as we see it, then discuss:

Why we think RSPs are promising. In brief (more below):

What we see as the key components of a good RSP, with sample language for each component. In brief, we think a good RSP should cover:

Adopting an RSP should be a strong and reliable signal that an AI developer will in fact identify when it’s too dangerous to keep scaling up capabilities, and react appropriately.

 

I'm excited about labs adopting RSPs for several reasons:

 

Possible discussion on twitter here and here.


blueberry @ 2023-09-28T13:12 (+1)

1. I like the idea of concrete (publicly stated) pre-defined measures, since it lowers the risk of moving safety standards/targets. It would be a substantial improvement over what we have today, especially if there's coordination between top labs.

2. The graph shows jumps where y increases at a rate greater than x. Has this ever happened before? What we've seen so far is more of a mirrored L. First we move along the x-axis, later (to a smaller degree) along the y-axis. 

3. The line between the red and blue area should be heavily blurred/striped. This might seem like an aesthetic nitpick, but we can't map the edges of what we've never seen. Our current perceptions are thought up by human minds that are innately tuned to empathize with and predict human behavior, which unwittingly leads to thinking along the lines: "If I was an AI and thought like a psychopathic human, what would I do?". We don't do this explicitly, but that's what we're actually doing. The real danger lies in the unknown unknowns, which cannot be plotted on a graph a priori. At the moment, we're assuming progression of dangers/capabilities in a "logical order", i.e. the way humans gain abilities/learn things. If the order is thrown around, so are the warning signs.