OpenAI: Preparedness framework

By Zach Stein-Perlman @ 2023-12-18T18:30 (+24)

This is a linkpost to https://openai.com/safety/preparedness

OpenAI released a beta version of their responsible scaling policy (though they don't call it that). See summary page, full doc, OpenAI twitter thread, and Jan Leike twitter thread [edit: and Zvi commentary]. Compare to Anthropic's RSP and METR's Key Components of an RSP.

It's not done, so it's too early to celebrate, but based on this document I expect to be happy with the finished version. I think today is a good day for AI safety.

[Edit, one day later: the structure seems good, but I'm very concerned that the thresholds for High and Critical risk in each category are way too high, such that e.g. a system could very plausibly kill everyone without reaching Critical in any category. See pp. 8–11. If so, that's a fatal flaw for a framework like this. I'm interested in counterarguments; for now, praise mostly retracted; oops. I still prefer this to no RSP-y-thing, but I was expecting something stronger from OpenAI. I really hope they lower thresholds for the finished version of this framework.]


My high-level take: RSP-y things are good.

 

OpenAI's basic framework:

  1. Do dangerous capability evals at least every 2x increase in effective training compute. This involves fine-tuning for dangerous capabilities, then doing evals on pre-mitigation and post-mitigation versions of the fine-tuned model. Score the models as Low, Medium, High, or Critical in each of several categories.
    1. Initial categories: cybersecurity, CBRN (chemical, biological, radiological, nuclear threats), persuasion, and model autonomy.
  2. If the post-mitigation model scores High in any category, don't deploy it until implementing mitigations such that it drops to Medium.
  3. If the post-mitigation model scores Critical in any category, stop developing it until implementing mitigations such that it drops to High.
  4. If the pre-mitigation model scores High in any category, harden security to prevent exfiltration of model weights. (Details basically unspecified for now.)

 

Random notes:


Misc remarks added later: