AI companies aren't really using external evaluators

By Zach Stein-Perlman @ 2024-05-26T19:05 (+88)

From my new blog: AI Lab Watch. All posts will be crossposted to LessWrong. Subscribe on Substack.

Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.

Clarification: there are many kinds of audits. This post is about model evals for dangerous capabilities. But I'm not aware of the labs using other kinds of audits to prevent extreme risks, excluding normal security/compliance audits.


Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.

The evaluator should get deeper access than users will get.

The costs of using external evaluators are unclear.

Independent organizations that do model evals for dangerous capabilities include METR, the UK AI Safety Institute (UK AISI), and Apollo. Based on public information, there's only one recent instance of a lab giving access to an evaluator pre-deployment—Google DeepMind sharing with UK AISI—and that sharing was minimal (see below).

What the labs say they're doing on external evals before deployment:


Related miscellanea:

External red-teaming is not external model evaluation. External red-teaming generally involves sharing the model with several people with expertise relevant to a dangerous capability (e.g. bioengineering) who open-endedly try to elicit dangerous model behavior for ~10 hours each. External model evals involves sharing with a team of experts at eliciting capabilities, to perform somewhat automated and standardized evals suites that they've spent ~10,000 hours developing.

Labs' commitments to share pre-deployment access with UK AISI are unclear.[5]

This post is about sharing model access before deployment for risk assessment. Labs should also share deeper access with safety researchers (during deployment). For example, some safety researchers would really benefit from being able to fine-tune GPT-4, Claude 3 Opus, or Gemini, and my impression is that the labs could easily give safety researchers fine-tuning access. More speculatively, interpretability researchers could send a lab code and the lab could run it on private models and send the results to the researchers, achieving some benefits of releasing weights with much less downside.[6]

Everything in this post applies to external deployment. It will also be important to do some evals during training and before internal deployment, since lots of risk might come from weights being stolen or the lab using AIs internally to do AI development.

Labs could be bound by external evals, such that they won't deploy a model until a particular eval says it's safe. This seems unlikely to happen (for actually meaningful evals) except by regulation. (I don't believe any existing evals would be great to force onto the labs, but if governments were interested, evals organizations could focus on creating such evals.)


Thanks to Buck Shlegeris, Eli Lifland, Gabriel Mukobi, and an anonymous human for suggestions. They don't necessarily endorse this post.

Subscribe on Substack.

  1. ^

    METR's homepage says:

    We have previously worked with Anthropic, OpenAI, and other companies to pilot some informal pre-deployment evaluation procedures. These companies have also given us some kinds of non-public access and provided compute credits to support evaluation research.

    We think it’s important for there to be third-party evaluators with formal arrangements and access commitments - both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations.

    We do not yet have such arrangements, but we are excited about taking more steps in this direction.

  2. ^
  3. ^

    Idea: when sharing a model for external evals or red-teaming, for each mitigation (e.g. harmlessness fine-tuning or filters), either disable it or make it an explicit part of the safety case for the model. Either claim "users can't effectively jailbreak the model given the deployment protocol" or disable. Otherwise the lab is just stopping the bioengineering red-teamers from eliciting capabilities with mitigations that won't work against sophisticated malicious users.

  4. ^

    A previous version of this post omitted discussion of external testing of Gemini 1.5 Pro. Thanks to Mary Phuong for pointing out this error.

  5. ^

    Politico and UK government press releases report that AI labs committed to share pre-deployment access with UK AISI. I suspect they are mistaken and these claims trace back to the UK AI safety summit "safety testing" session, which is devoid of specific commitments. I am confused about why the labs have not clarified their commitments and practices.

  6. ^

Toby Tremlett @ 2024-05-30T12:41 (+17)

I'm curating this post. 
I wrote a draft for a feature on a politico piece for the EA newsletter, exploring this same question-- are big labs following through on verbal commitments to share their models with external evaluators? Despite taking "several months" to speak with experts- the politico piece didn't have as much useful information as this blog post. I cut the feature, because I couldn't find as much information in the time I had.
I think work like this is really valuable, filling a serious gap in our understanding of AI Safety. Thanks for writing this Zach!

phgubbins @ 2024-06-04T19:05 (+5)

Cross-commenting from lesswrong for future reference:

I had an opportunity to ask an individual from one of the mentioned labs about plans to use external evaluators and they said something along the lines of:

“External evaluators are very slow - we are just far better at eliciting capabilities from our models.”

They earlier said something much to the same effect when I asked if they’d been surprised by anything people had used deployed LLMs for so far, ‘in the wild’. Essentially, no, not really, maybe even a bit underwhelmed.

SummaryBot @ 2024-05-27T13:32 (+1)

Executive summary: Despite claims, AI companies are not consistently using external evaluators to assess their models for dangerous capabilities before deployment, which could improve risk assessment and provide public accountability.

Key points:

  1. External evaluators like METR, UK AISI, and Apollo are not being given sufficient pre-deployment access to models to effectively evaluate risks.
  2. Evaluators should be given deeper access than end users, including versions without safety filters or fine-tuning, to better assess threats from both deployment and potential leaks.
  3. The costs and challenges of providing external evaluator access are unclear, but likely surmountable.
  4. Some AI labs have committed to external "red teaming" but this is not a substitute for comprehensive model evaluation by dedicated experts.
  5. AI labs should also provide deeper model access to safety researchers to support their work, potentially without fully releasing weights.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.