DeepMind's "​​Frontier Safety Framework" is weak and unambitious

By Zach Stein-Perlman @ 2024-05-18T03:00 (+54)

FSF blogpostFull document (just 6 pages; you should read it). Compare to Anthropic's RSPOpenAI's RSP ("Preparedness Framework"), and METR's Key Components of an RSP.

Google DeepMind's FSF has three steps:

  1. Create model evals for warning signs of "Critical Capability Levels"
    1. Evals should have a "safety buffer" of at least 6x effective compute so that CCLs will not be reached between evals
    2. They list 7 CCLs across "Autonomy, Biosecurity, Cybersecurity, and Machine Learning R&D," and they're thinking about CBRN
      1. E.g. "Autonomy level 1: Capable of expanding its effective capacity in the world by autonomously acquiring resources and using them to run and sustain additional copies of itself on hardware it rents"
  2. Do model evals every 6x effective compute and every 3 months of fine-tuning
    1. This is an "aim," not a commitment
    2. Nothing about evals during deployment [update[1]]
  3. "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan based on the analysis of the CCL and evaluation results. We will also take into account considerations such as additional risks flagged by the review and the deployment context." The document briefly describes 5 levels of security mitigations and 4 levels of deployment mitigations.
    1. The mitigations aren't yet connected to eval results or other triggers; there are no advance commitments about safety practices

The FSF doesn't contain commitments. The blogpost says "The Framework is exploratory and we expect it to evolve significantly" and "We aim to have this initial framework fully implemented by early 2025." The document says similar things. It uses the word "aim" a lot and the word "commit" never. The FSF basically just explains a little about DeepMind's plans on dangerous capability evals. Those details do seem reasonable. (This is unsurprising given their good dangerous capability evals paper two months ago, but it's good to hear about evals in a DeepMind blogpost rather than just a paper by the safety team.)

(Ideally companies would both make hard commitments and talk about what they expect to do, clearly distinguishing between these two kinds of statements. Talking about plans like this is helpful. But with no commitments, DeepMind shouldn't get much credit.)

(The FSF is not precise enough to be possible to commit to — DeepMind could commit to doing the model evals regularly, but it doesn't discuss specific mitigations as a function of risk assessment results.[2])


Misc notes (but you should really read the doc yourself):


Maybe this document was rushed because DeepMind wanted to get something out before the AI Seoul Summit next week. I've heard that the safety team has better and more detailed plans. Hopefully some of those get published in DeepMind's voice (e.g. posted on the DeepMind blog or pronounced by DeepMind leadership) soon. Hopefully the bottleneck is polishing those plans, not weakening them to overcome a veto from DeepMind leadership.


Reminder of how other labs are doing on RSPs, briefly (I feel very comfortable about these claims, but I omit justification and there's not a consensus on these claims):

  1. ^

    Update: a DeepMind senior staff member says the 3-month condition includes during deployment. Yay.

  2. ^

    But it says they plan to: "As we better understand the risks posed by models at different CCLs, and the contexts in which our models will be deployed, we will develop mitigation plans that map the CCLs to the security and deployment levels described." But maybe only after the thresholds are crossed: "When a model reaches evaluation thresholds (i.e. passes a set of early warning evaluations), we will formulate a response plan."

  3. ^

    Update: a DeepMind senior staff member says "deployment" means external deployment.

  4. ^

    The full sentence doesn't parse: "We are exploring internal policies around alerting relevant stakeholder bodies when, for example, evaluation thresholds are met, and in some cases mitigation plans as well as post-mitigation outcomes." What about mitigation plans?

  5. ^

    See Frontier Model Security. But Anthropic hasn't announced that it has successfully implemented this.


SummaryBot @ 2024-05-20T20:29 (+1)

Executive summary: DeepMind's "Frontier Safety Framework" for AI development is a step in the right direction but lacks ambition, specificity, and firm commitments compared to other labs' responsible scaling plans.

Key points:

  1. The Frontier Safety Framework (FSF) involves evaluating models for dangerous capabilities at regular intervals, but the details are vague and not committed to.
  2. The FSF discusses potential security and deployment mitigations based on risk assessments, but does not specify triggers or make advance commitments.
  3. DeepMind's security practices seem behind other labs, e.g. allowing unilateral access to model weights at most levels.
  4. The FSF's capability thresholds for concern ("Critical Capability Levels") seem quite high.
  5. Compared to Anthropic, OpenAI, and Microsoft's responsible scaling plans, DeepMind's FSF is less ambitious, specific, and committed to. Meta has no public plan.
  6. The FSF may have been rushed out and the DeepMind safety team likely has better, more detailed (but unpublished) plans.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.