RSPs are pauses done right

By evhub @ 2023-10-14T04:06 (+93)

This is a crosspost, probably from LessWrong. Try viewing it there.

null
aaguirre @ 2023-10-14T19:27 (+42)

The important things about a pause, as envisaged in the FLI letter, for example, are that (a) it actually happens, and (b) the pause is not lifted until there is affirmative demonstration that the risk is lifted. The FLI pause call was not, in my view, on the basis of any particular capability or risk, but because of the out-of-control race to do larger giant scaling experiments without any reasonable safety assurances. This pause should still happen, and it should not be lifted until there is a way in place to assure that safety. Many of the things FLI hoped could happen during the pause are happening — there is huge activity in the policy space developing standards, governance, and potentially regulations. It's just that now those efforts are racing the un-paused technology.

In the case of "responsible scaling" (for which I think the ideas of "controlled scaling" or "safety-first scaling" would be better), what I think is very important is that there not be a presumption that the pause will be temporary, and lifted "once" the right mitigations are in place. We may well hit point (and may be there now), where it is pretty clear that we don't know how to mitigate the risks of the next generation of systems we are building (and it may not even be possible), and new bigger ones should not be built until we can do so. An individual company pausing "until" it believes things are safe is subject to the exact same competitive pressures that are driving scaling now — both against pausing, and in favor of lifting a pause as quickly as possible. If the limitations on scaling come from the outside, via regulation or oversight, then we should ask for something stronger: before proceeding, show to those outside organizations that scaling is safe. The pause should not be lifted until or unless that is possible. And that's what the FLI pause letter asks for.

Akash @ 2023-10-14T23:38 (+32)

Adding this comment over from the LessWrong version. Note Evan and others have responded to it here.

Thanks for writing this, Evan! I think it's the clearest writeup of RSPs & their theory of change so far. However, I remain pretty disappointed in the RSP approach and the comms/advocacy around it.

I plan to write up more opinions about RSPs, but one I'll express for now is that I'm pretty worried that the RSP dialogue is suffering from motte-and-bailey dynamics. One of my core fears is that policymakers will walk away with a misleadingly positive impression of RSPs. I'll detail this below:

What would a good RSP look like?

What do RSPs actually look like right now?

Important note: I think several of these limitations are inherent to current gameboard. Like, I'm not saying "I think it's a bad move for Anthropic to admit that they'll have to break their RSP if some Bad Actor is about to cause a catastrophe." That seems like the right call. I'm also not saying that dangerous capability evals are bad-- I think it's a good bet for some people to be developing them.

Why I'm disappointed with current comms around RSPs

Instead, my central disappointment comes from how RSPs are being communicated. It seems to me like the main three RSP posts (ARC's, Anthropic's, and yours) are (perhaps unintentionally?) painting and overly-optimistic portrayal of RSPs. I don't expect policymakers that engage with the public comms to walk away with an appreciation for the limitations of RSPs, their current level of vagueness + "we'll figure things out later"ness, etc. 

On top of that, the posts seem to have this "don't listen to the people who are pushing for stronger asks like moratoriums-- instead please let us keep scaling and trust industry to find the pragmatic middle ground" vibe. To me, this seems not only counterproductive but also unnecessarily adversarial. I would be more sympathetic to the RSP approach if it was like "well yes, we totally think it'd great to have a moratorium or a global compute cap or a kill switch or a federal agency monitoring risks or a licensing regime", and we also think this RSP thing might be kinda nice in the meantime. Instead, ARC explicitly tries to paint the moratorium folks as "extreme". 

(There's also an underlying thing here where I'm like "the odds of achieving a moratorium, or a licensing regime, or hardware monitoring, or an agency that monitors risks and has emergency powers— the odds of meaningful policy getting implemented are not independent of our actions. The more that groups like Anthropic and ARC claim "oh that's not realistic", the less realistic those proposals are. I think people are also wildly underestimating the degree to which Overton Windows can change and the amount of uncertainty there currently is among policymakers, but this is a post for another day, perhaps.)

I'll conclude by noting that some people have went as far as to say that RSPs are intentionally trying to dilute the policy conversation. I'm not yet convinced this is the case, and I really hope it's not. But I'd really like to see more coming out of ARC, Anthropic, and other RSP-supporters to earn the trust of people who are (IMO reasonably) suspicious when scaling labs come out and say "hey, you know what the policy response should be? Let us keep scaling, and trust us to figure it out over time, but we'll brand it as this nice catchy thing called Responsible Scaling."

David Krueger @ 2023-11-22T23:11 (+11)

"With the use of fine-tuning, and a bunch of careful engineering work, capabilities evaluations can be done reliably and robustly."  

I strongly disagree with this (and the title of the piece).  I've been having these arguments a lot recently, and I think these sorts of claims are emblamatic of a dangerously narrow view on the problem of AI x-safety, which I am disappointed to see seems quite popular.
 
A few reasons why this statement is misleading: 
* New capabilities ellicitation techniques arrive frequently and unpredictably (think chain of thought, e.g.)
* The capabilities of a system could be much greater than any particular LLM involved in that system (think tool use and coding).  On the current trajectory, LLMs will increasingly be heavily integrated into complex socio-technical systems.  The outcomes are unpredictable, but it's likely such systems will exhibit capabilities significantly beyond what can be predicted from evaluations.

You can try to account for the fact that you're competing against the entire world's ingenuity by your privileged access (e.g. for fine-tuning or white-box capabilities ellicitation methods), but this is unlikely to provide sufficient coverage.

EtA: Understanding whether and to what extent the original claim is true is something that would likely require years of research at a minimum. 

evhub @ 2023-11-23T01:33 (+2)

I think this is a very good point, and it definitely gives me some pause—and probably my original statement there was too strong. Certainly I agree that you need to do evaluations using the best possible scaffolding that you have, but overall my sense is that this problem is not that bad. Some reasons to think that:

  • At least currently, scaffolding-related performance improvements don't seem to generally be that large (e.g. chain-of-thought is just not that helpful on most tasks), especially relative to the gains from scaling.
  • You can evaluate pretty directly for the sorts of capabilities that would help make scaffolding way better, like the model being able to correct its own errors, so you don't have to just evaluate the whole system + scaffolding end-to-end.
  • This is mostly just a problem for large-scale model deployments. If you instead keep your largest model mostly in-house for alignment research, or only give it to a small number of external partners whose scaffolding you can directly evaluate, it makes this problem way less bad.

That last point is probably the most important here, since it demonstrates that you easily can (and should) absorb this sort of concern into an RSP. For example, you could set a capabilities threshold for models' ability to do self-correction, and once your models pass that threshold you restrict deployment except in contexts where you can directly evaluate the relevant scaffolding that will be used in advance.

Greg_Colbourn @ 2023-10-15T11:48 (+3)

If you actually want a full stop to happen, I think the best way to make that happen is still going to look like my story above, just with RSP thresholds that are essentially impossible to meet.

Perhaps. I could get on board with that in the event the RSP paradigm is sticky. We are already past the thresholds where we should be stopping further AGI development. The fire alarm has been ringing for months already (or longer). I fully agree with aaguirre.

Greg_Colbourn @ 2023-10-15T11:43 (+3)

Labs... don’t know how to respond to nebulous pause advocacy because it isn’t clearly asking for any particular policy (since nobody actually likes and is advocating for the six month pause proposal)

This is really quite enraging to read. Stop building bigger AIs! It's that simple. The rest of the details regarding whether, when and how to restart can be worked out later. 

AI labs saying they don't know how to respond here is like fossil fuel companies saying they don't know what they can do to mitigate climate change. It's sounds as if actually stopping is so inconceivable that the response is to come up with complicated frameworks that sound like they might (eventually) lead to stopping, but in fact are doing everything they can to allow the companies to continue business as usual.

Adam_Scholl @ 2023-10-17T12:16 (+9)

Yeah, Dario pretty explicitly describes liking RSPs in part because they minimally constrain continued scaling:

"I mean one way to think about it is like the responsible scaling plan doesn't slow you down except where it's absolutely necessary. It only slows you down where it's like there's a critical danger in this specific place, with this specific type of model, therefore you need to slow down." (Logan Bartlett interview, h/t Joe_Collman).