Evals projects I'd like to see, and a call to apply to OP's evals RFP

By cb @ 2025-03-25T11:50 (+19)

I'm advertising this RFP in my professional capacity, but the project ideas below are my personal views: my colleagues may not endorse them. 

There's just one week left before applications for our RFP on Improving Capability Evaluations close! The rest of this post is a call to apply, and some concrete suggestions of projects I'd be excited to see.

If you're feeling inspired already, then:

Apply now

Areas I'm most excited about

We've received many strong proposals around building new GCR-relevant benchmarks (which is great!), but there's still lots of low-hanging fruit in two key areas:

  1. Third-party access & security solutions (I'm most excited about this)
  2. Science of evaluations & capabilities development

Below I briefly motivate these areas and give some example projects. For more details, check out their sections in our RFP (security/access, science of evals). 

Third-party access & security solutions

As AI companies move to stricter security levels, meaningful independent evaluation will become increasingly difficult. I don't think we're prepared for this: not only do we lack good plans for mitigating the security risks of giving third parties access to proprietary information or models, we don't yet know what we should be asking of companies.

There's scope here for both technical and non-technical projects, from thinking clearly about evaluation regimes or information-sharing commitments, to work that enables secure, low-trust third-party evaluations and red-teaming.

Concrete projects I'd love to see:

1. A thorough analysis of what info/access evaluators need to make different safety assessments

A project mapping out what minimum information about model training evaluators need to assess safety cases, what levels of model access enable what kinds of evaluations, and how the security costs of sharing this information can be mitigated. 

An ideal version would spell out the strongest possible safety guarantees third parties can make with different levels of information/model access, identify the costs of increasing transparency or access, and propose arrangements that maximise the strength of safety assurances third parties can make, for a given cost.

2. Technical solutions for verifiable model auditing

Work on ZKPs, secure enclaves, confidential computing, or other approaches that could let evaluators verify model properties without compromising model security.

I'm excited about methods for:

  1. Trustless model evaluations without revealing model architecture
  2. Verifying model identity (i.e., which model you're running inference on) without weights access
  3. Providing >API level access to third parties, ideally without revealing model weights
  4. Enabling fine-tuning for third parties
  5. Obfuscating queries/outputs from the model provider, where appropriate

3. Comparing different evaluation governance frameworks

I'd love to see a careful analysis comparing different approaches to evaluations, e.g. lab self-evaluation vs. government-run evaluations vs. trusted third-party approaches. Some useful aims might include understanding how might each approach fail, their tradeoffs, and the kind of legal and regulatory protections that would enable them.

4. Transparency standards for evaluation conditions

Work on understanding the most important information for evaluators to report and standardising how evaluation conditions are reported, including e.g.:

Science of evaluations & capabilities development

Current evaluations are more like "model whispering" than science. Results are significantly affected by prompting, scaffolding, tool access, question format, inference compute spend, and choice of test conditions.

I'd love to see work that helps make interpreting evals results more precise and reliable, or at least establishes the difficulty of making certain kinds of claims.

Concrete projects I'd love to see:

1. Quantifying post-training variables and their effects

A project systematically measuring how different post-training enhancements (factors like scaffolding effort, inference compute, fine-tuning strategies) affect evaluation results, and ideally developing scaling laws for model performance given changes in these factors. 

2. Password-locked models

Instead of trying to uncover maximal abilities directly, hide the model's current capabilities and then try to uncover them using various elicitation techniques. This establishes ground truth for studying and comparing elicitation methods.

3. Statistical methodology for agent benchmarks

Extending statistical best practices from QA to agent evaluations.

4. Measuring and narrowing the elicitation gap

The gap between the capabilities leading AI companies can elicit versus those that third parties can is significant. We need better approaches to measure and narrow this gap, especially as it relates to dangerous capabilities.

5. Understanding when to (re)evaluate models

Models improve through not just new training runs but also post-training enhancements, which are cheap and can happen ~continuously. We need better heuristics for when advances in post-training techniques should trigger re-evaluation of models.

Some other ideas I'm excited about

This isn't exhaustive!

The above list is a subset of projects  I'm particularly excited about. There are many other valuable projects not mentioned here, and great ideas I haven't even thought of! For nore details, check out the full RFP. For other project ideas, you may also enjoy reading Marius' list of concrete problems in evals.

If you've been considering applying to our RFP, now's the time! The initial Expression of Interest takes ≤1 hour, and we can refine promising ideas together during the full proposal stage.

Deadline: April 1st, 2025.

Apply now
Chris Leong @ 2025-03-25T15:36 (+4)

One thing I'd be much more excited about seeing rather than "quantifying post-training variables and their effects" (but which I'm not planning to pursue) would be to take an old model and then to try to map post-training enhancements discovered over time and see how the maximum elicitable capabilities change.

I'm worried that quantifying post-training variables directly has significant capabilities externalities and that there's no obvious limit to how far post-training can be pushed.

cb @ 2025-03-25T16:40 (+3)

I'd also be excited about projects aiming to do this.

One advantage that quantifying post-training variables on frontier models has over this idea is that you also get a better sense of what the upper bound of performance on some eval looks like, as well as some information about the returns from investing in post-training enhancements. I think if this were done responsibly on some well-chosen evals, it'd be helpful information to have. (Though my colleagues may disagree.)

If people outside of frontier labs were working on this, I'd be surprised if it significantly accelerated capabilities, though I can imagine it still making sense to keep the methodology private.