Model evals for dangerous capabilities

By Zach Stein-Perlman @ 2024-09-23T11:00 (+19)

Testing an LM system for dangerous capabilities is crucial for assessing its risks.

Summary of best practices

Best practices for labs evaluating LM systems for dangerous capabilities:

Using easier evals or weak elicitation is fine, if it's done such that by the time the model crosses danger thresholds, the developer will be doing the full eval well.

These recommendations aren't novel — they just haven't been collected before.

How labs are doing

I made a table: DC evals: labs' practices.

This post basically doesn't consider some crucial factors in labs' evals, especially the quality of particular evals, nor some adjacent factors, especially capability thresholds and planned responses to reaching capability thresholds. One good eval is better than lots of bad evals but I haven't figured out which evals are good. What's missing from this post—especially which evals are good, plus illegible elicitation stuff—may well be more important than the desiderata this post captures.

DeepMindOpenAI ≥ Anthropic >>> Meta >> everyone else. The evals reports published by DeepMind, OpenAI, and Anthropic are similar. I only weakly endorse this ranking, and reasonable people disagree with it.

DeepMind, OpenAI, and Anthropic all have decent coverage of threat models; DeepMind is the best. They all do fine on sharing their methodology and reporting their results; DeepMind is the best. They're all doing poorly on sharing with third-party evaluators before deployment; DeepMind is the worst. (They're also all doing poorly on setting capability thresholds and planning responses, but that's out-of-scope for this post.)

My main asks on evals for the three labs are similar: improve the quality of their evals, improve their elicitation, be more transparent about eval methodology, and share better access with a few important third-party evaluators. For OpenAI and Anthropic, I'm particularly interested in them developing (or adopting others') evals for scheming or situational awareness capabilities, or committing to let Apollo evaluate their models and incorporate that into their risk assessment process.

(Meta does some evals but has limited scope, fails to share some details on methodology and results, and has very poor elicitation. Microsoft has a "deployment safety board" but it seems likely ineffective, and it doesn't seem to have a plan to do evals. xAI, Amazon, and others seem to be doing nothing.)

Appendix: misc notes on best practices

This section is composed of independent paragraphs with various notes on best practices.

Publishing evals and explaining methodology has several benefits: it lets external observers check whether your evals are good, lets them suggest improvements, lets other labs adopt your evals, and can boost the general science of evals. The downsides are that evaluations could contain dangerous information—especially CBRN evals—and that publishing the evals could cause solutions to appear in future models' training data. When the downsides are relevant, labs can achieve lots of the benefits with little of the downsides by sharing evals privately (with other labs, the government, and relevant external evaluators) and perhaps publishing a small semirandom subset of evals with transcripts. Optionally, labs can use the METR standard or Inspect. Publishing eval results informs observers about your model's dangerous capabilities and provides accountability for interpreting results well and responding well. If you publish enough details, others can run your evals and compare to your scores for risk-assessment and sanity-checking purposes, in addition to noticing issues and helping you improve your evals. For human experiments: publish methodology details (such that a third party could basically replicate it, modulo access to model versions, posttraining, or scaffolding).

=====

Elicitation:

=====

Fix spurious failures: look at transcripts to identify why the agent fails and whether the failures are due to easily fixable mistakes or issues with the task or infrastructure. If so, fix them, or at least quantify how common they are. Example.

Separately from looking at the transcripts, fixing issues, and summarizing findings, ideally the labs would publish (a random subset of) transcripts.

=====

Using easier evals or weak elicitation is sometimes fine, if done right.[1]

=====

Third-party evals.

=====

Labs should use high-quality open-source evals, such as some DeepMind evals (especially self-reasoning and CTF); InterCode-CTFCybench, or other CTFs; maybe some OpenAI evals; and maybe METR autonomy evals (this does not substitute for offering access to METR). When a lab doesn't have an in-house eval for an area of capabilities, it can use others' evals; even when it does, using others' evals can improve its risk assessment and enable observers to understand the eval better and to predict future models' performance.

=====

Which models should a lab evaluate?

When should a lab run evals? Different threats require different times. 

=====

Labs may keep elicitation techniques secret to preserve their advantage, but sharing such information seems fine in terms of safety. But for now this is moot: labs' elicitation in evals for dangerous capabilities seems quite basic, not using secret powerful techniques.

=====

This post is about evals for dangerous capabilities. Some other kinds of evals are relevant to extreme risk too:

Appendix: misc notes on particular evals

This is very nonexhaustive, shallow (just based on labs' reports, not looking at particular tasks), and generally of limited insight. I'm writing it because I'm annoyed that when labs release or publish reports on their evals, nobody figures out whether the evals are good (or at least they don't tell me). Perhaps this section will inspire someone to write a better version of it.

Google DeepMind:

Evals sources: evals paperevals repoFrontier Safety FrameworkGemini 1.5.

Existing comments: Ryan Greenblatt.

Evals:

DeepMind scored performance on each eval 0-4 (except CBRN), but doesn't have predetermined thresholds, and at least some evals would saturate substantially below 4. DeepMind's FSF has high-level "Critical Capability Levels" (CCLs); they feel pretty high; they use different categories from the evals described above (they're in Autonomy, Biosecurity, Cybersecurity, and ML R&D).[2]

OpenAI:

Evals sources: o1 system cardPreparedness Frameworkevals repo.

Evals:

OpenAI's PF has high-level "risk levels"; they feel pretty high; they are not operationalized in terms of low-level evals.

Anthropic:

Evals sources: RSP evals report – Claude 3 OpusResponsible Scaling Policy.

Evals:

Meta:

Evals sources: Llama 3CyberSecEval 3.

Evals:

Appendix: reading list

Sources on evals and elicitation:[3]

Sources on specific labs:


Thanks to several anonymous people for discussion and suggestions. They don't necessarily endorse this post.

Crossposted from AI Lab Watch. Subscribe on Substack.

  1. ^

    If a model is far from having a dangerous capability, full high-quality evals for that capability may be unnecessarily expensive. They may also be uninformative if the model scores close to zero and the eval struggles to detect small differences in the capability near the current level.

    In this case, the developer can use a strictly easier version of the eval. The easier version will trigger before the actual version so there is no risk, and the easier version could be cheaper and more informative.

    Or the developer could use a separate easier or cheaper yellow-line eval if its threshold is low enough that it is sure to trigger before the relevant threshold in the actual eval. Insofar as the relationship between performance on the two evals is unpredictable, the developer will have to set the yellow-line threshold lower. Hopefully in the future we will determine patterns in models' performance on different evals, and this will let us say we're quite sure that scoring <k% on an easy eval means you'll score <x% on the real eval for smaller x.

    (If the developer might be close to danger thresholds, then if it uses easier evals, the actual evals should be ready to go or the developer should be prepared to pause until doing them.)

    Similarly, weak elicitation can be fine if the developer uses a sufficiently large safety buffer. But the upper bound on the power of core elicitation techniques (fine-tuning, chain of thought, basic tooling) is very high, so the developer basically has to use them. And the decent elicitation to excellent elicitation gap can be large and unpredictable.

    (Ideal elicitation quality and safety buffer depends on the threat: misuse during intended deployment or the model being stolen, being leaked, or escaping. If the former, it also depends on users' depth of access and whether post-deployment-correction is possible.)

  2. ^

     Misc notes:

    1. The FSF says:
      > we will define a set of evaluations called “early warning evaluations,” with a specific “pass” condition that flags when a CCL may be reached before the evaluations are run again. We are aiming to evaluate our models every 6x in effective compute and for every 3 months of fine-tuning progress. To account for the gap between rounds of evaluation, we will design early warning evaluations to give us an adequate safety buffer before a model reaches a CCL.
    2. The evals paper proposes a CCL for self-proliferation and tentatively suggests an early warning trigger. But this isn't in the FSF. And it says when a model meets this trigger, it is likely within 6x [] "effective compute" scaleup from the [] CCL, but a safety buffer should be almost certainly >6x effective compute from the CCL.
  3. ^

     This list may be bad. You can help by suggesting improvements.