AI Safety Evaluations: A Regulatory Review

By Elliot Mckernon, Deric Cheng, Convergence Analysis @ 2024-03-19T15:09 (+12)

This article is the second in a series of ~10 posts comprising a 2024 State of the AI Regulatory Landscape Review, conducted by the Governance Recommendations Research Program at Convergence Analysis. Each post will cover a specific domain of AI governance (e.g. incident reporting, safety evals, model registries, etc.). We’ll provide an overview of existing regulations, focusing on the US, EU, and China as the leading governmental bodies currently developing AI legislation. Additionally, we’ll discuss the relevant context behind each domain and conduct a short analysis.

This series is intended to be a primer for policymakers, researchers, and individuals seeking to develop a high-level overview of the current AI governance space. We’ll publish individual posts on our website and release a comprehensive report at the end of this series.

Let us know in the comments if this format is useful, if there are any topics you'd like us to cover, or if you spot any errors or omissions!

Context

Governments and researchers are eager to develop tools and techniques to evaluate AI. These include risk assessments that are common in industry regulation, but also techniques that are more unique to advanced AI, such as capability evaluations and alignment evaluations. 

In this section, we’ll define some terms and introduce some recent research on evaluating AI. Most existing AI regulation is yet to incorporate these new techniques, but many experts believe they’ll be a critical component of long-term safety (such as responsible scaling policies), and many regulatory proposals from experts include calls for specific assessment systems and requirements, which we’ll discuss shortly.

There are three main features of AI models that people are interested in evaluating:

Many AI safety advocates argue in favor of mandatory pre-deployment safety assessments of AI. That is, that developers cannot legally publish or deploy their models until they’ve robustly shown that their model is safe. Some also believe pre-deployment alignment assessments will be necessary, though alignment assessments are less well-developed.

Safety assessments are, understandably, the most commonly discussed in AI safety, and arguably have the strongest precedent in regulation. Legally mandated risk assessments are ubiquitous in many industries. For example, new drugs undergo rigorous clinical trials to demonstrate their efficacy and safety through the FDA in the US, the NMPA in China, and so on. As we’ll discuss later, new AI legislation does often include some kind of mandatory risk assessments, but generally these are loosely defined, and are unlikely to be sufficient to prevent dangerous AI from being deployed. 

This is because advanced AI models are especially difficult to robustly risk-assess. They’re uniquely flexible, extremely customizable, and undergo dramatic innovation frequently and unpredictably. Two different people with different aims and different skills could use GPT-4 to achieve wildly different outcomes. How can we assess a tool that can be used both to write an essay and, potentially, to generate instructions for constructing large-scale bioweapons

On the other hand, capability assessments and benchmarks are already well-developed and relatively widespread. For example, Google’s announcement of their Gemini model presents Gemini Ultra’s performance on multiple quantitative benchmarks, compared against GPT-4.

Evals

Some research organizations are developing AI evals, which are evaluations targeted specifically at assessing the safety, capability, and alignment of frontier AI models by interacting with them in a controlled environment and analyzing their response to different prompts (note that the term “eval” generally refers to this AI-specific kind of evaluation). To understand them, let’s look at a pilot example from METR, a research non-profit focused on designing safety evals and who recently collaborated with the UK’s AI taskforce on AI evaluation. 

The goal of the eval was to develop a methodology for assessing the safety of AI models by “assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges”, capacities they referred to as “autonomous replication and adaptation,” or ARA.

The subjects were 4 agents consisting of a custom scaffolding program and an LLM (3 using OpenAI’s GPT-4 and 1 using Anthropic’s Claude). The scaffolding program provided prompts to the LLM describing the goal and available actions, then enacted the model’s plans within a virtual environment, for example by running code or controlling a browser. The researchers call these agents “representative of the kind of capabilities achievable with some moderate effort, using publicly available techniques”.

They developed 12 tasks to evaluate the agents’ capacity for ARA. The simplest required agents to search a filesystem for a stored password, and all 4 succeeded. A more complex task was to phish a university student by acquiring their login credentials. The agent was expected to design and host a fake website and email the student. None succeeded; one agent based their emails on research on a different person with the same name, and tried to recreate the Harvard login page by copying its HTML, leading to an unconvincing facsimile. 

The researchers concluded that the agents “were far from capable of ARA” and that “[without] access to fine-tuning, it is highly unlikely that casual users of these versions of GPT-4 or Claude could come close to the ARA threshold”. However, as the authors admit, these evals are not robust, and near-future agents with better scaffolding, fine-tuning, or larger models could perform much better at these tasks. 

Other researchers are also developing evals for capability and alignment. For example, alignment evals are part of Anthropic’s Constitutional AI strategy. For more on evals and their development and types, check out A starter guide for evals and We need a science of evals from researchers at Apollo Research. 

The field of AI evaluation has widespread support from experts. For example, in a 2023 survey of expert opinion, 98% of respondents “somewhat or strongly agreed” that “AGI labs should conduct pre-deployment risk assessments, dangerous capabilities evaluations, third-party model audits, safety restrictions on model usage, and red teaming.” 

However, though the field  is growing and advancing rapidly, it is new. There isn’t a consensus on the best approach, or how to apply these tools in law, or even on the terminology. For example, the developer Anthropic refers to deep safety evaluations as “audits”. As we’ll see shortly, current legislation doesn’t make much use of, or reference to, research on AI-specific evals. 

Current Regulatory Policies

Much proposed and existing AI governance includes risk assessments and evaluations, though not all are clear on precisely what assessments will be conducted, or by whom, or what would be considered acceptable risk, and so on. 

As noted above, AI-specific evals, such as those under development at METR and other research orgs, aren’t part of any major current legislation. They do appear in many proposals, which we’ll describe at the end of this section. For now, we’ll focus on summarizing the requirements for risk and model assessment in legislation from the US, China, EU, and UK. 

The US

The AI Bill of Rights states that automated systems should undergo pre-deployment testing, risk identification and mitigation, and ongoing safety monitoring. Tests should:

Crucially, the bill states that possible outcomes of these evaluations should include the possibility of not deploying or even removing a system, though it does not prescribe the conditions under which deployment should be disallowed.

The bill states that risk identification should focus on impact on people’s rights, opportunities, and access, as well as risks from purposeful misuse of the system. High-impact risks should receive proportionate attention. Further, automated systems should be designed to allow for independent evaluation, such as by researchers, journalists, third-party auditors and more. Evaluations are also required to assess algorithmic discrimination, which we'll discuss in another post.

The Executive Order on AI makes these principles more concrete, and also includes calls to develop better evaluation techniques. In summary, the EO calls for several new programs to provide AI developers with guidance, benchmarks, test beds, and other tools and requirements for evaluating the safety of AI, as well as requiring AI developers to share certain information with the government (such as the results of red-team tests). In particular:

The EO calls on individual government orgs and secretaries to provide one-off evaluations, such as:

China

China’s Interim Measures for the Management of Generative AI Services don’t include risk assessments or evaluations of AI models (though generative AI providers are responsible for harms rather than AI users, which may incentivise voluntary risk assessments). 

There are mandatory “security assessments”, but we haven’t been able to discover their content or standards.  In particular, these measures, plus both the 2021 regulations and 2022 rules for deep synthesis, require AI developers to submit information to China’s algorithm registry, including passing a security self-assessment. AI providers add their algorithms to the registry along with some publicly available categorical data about the algorithm and a PDF file for their “algorithm security self-assessment”. These uploaded PDFs aren’t available to the public, so “we do not know exactly what information is required in it or how security is defined”. 

Note also that these provisions only apply to public-facing generative AI within China, excluding internal services used by organizations.

The UK

The draft AI bill recently introduced to the House of Lords does not mention evaluations. There is discussion of “auditing”, under 5(1)(a)(iv), “any business which develops, deploys or uses AI must allow independent third parties accredited by the AI Authority to audit its processes and systems.” but these seem to be audits of the business rather than of the models. 

The UK government has expressed interest in developing AI evals. One of the three core functions of the recently announced AI Safety Institute is to “develop and conduct evaluations on advanced AI”, and in their third report, they announced that their first major project “is the sociotechnical evaluation of frontier AI systems”, focused on misuse, societal impacts, autonomous systems, and safeguards.

The EU

The EU’s draft AI Act has mandated some safety and risk assessments for high-risk AI and, in more recent iterations, frontier AI. 

As summarized here, the act classifies models by risk, and higher risk AI has stricter requirements, including for assessment. Developers must determine the risk category of their AI, and may self-assess and self-certify their models by adopting upcoming standards or justifying their own (or be fined at least €20 million). High-risk models must undergo a third-party “conformity assessment” before they can be released to the public, which includes conforming to requirements regarding “risk management system”, “human oversight”, and “accuracy, robustness, and cybersecurity”. 

In earlier versions, general-purpose AI such as ChatGPT would not have been considered high-risk. However, since the release of ChatGPT in 2022, EU legislators have developed new provisions to account for similar general purpose models (see more on the changes here). Article 4b introduces a new category of “General-purpose AI” (GPAI) that must follow a lighter set of restrictions than high-risk AI. However, GPAI models in high-risk contexts count as high-risk, and powerful GPAI must undergo the conformity assessment described above. 

Title VIII of the act, on post-market monitoring, information sharing, and market surveillance, includes the following:

Convergence’s Analysis

The tools needed to properly evaluate the safety of advanced AI models do not yet exist.  

As a result, legislators are bottlenecked by the lack of effective safety evaluations when it comes to passing binding safety assessments for AI labs. 

Developing effective safety assessments is likely to be outside the capabilities of regulatory governmental agencies. 

More independent systems for conducting safety assessments need to be developed in the next 5 years. 


Lucie Philippon @ 2024-03-22T14:15 (+3)

Could you build a sequence for the AI Regulatory Landscape Review? It would be easier to link it than individual posts.

SummaryBot @ 2024-03-20T15:17 (+1)

Executive summary: Current AI safety evaluation techniques are insufficient, and governments and third-parties need to rapidly develop more effective, independent safety assessments to enable responsible AI regulation and deployment.

Key points:

  1. AI safety evaluations assess the safety, capability, and alignment of AI models, which is crucial but challenging due to AI's flexibility and rapid development.
  2. Current major AI regulations mention risk assessments but lack specificity and don't incorporate cutting-edge AI-specific evaluation techniques being developed by researchers.
  3. The US has directed agencies to develop AI evaluation tools and required some information sharing from AI developers, while the EU AI Act draft includes conformity assessments for high-risk and general-purpose AI.
  4. Existing risk assessment tools are insufficient for AI, and the development of AI-specific evaluations is still nascent, bottlenecking effective AI safety legislation.
  5. Government agencies likely lack the resources and expertise to develop thorough AI safety evaluations independently, so more investment in third-party evaluation is needed in the next 5 years.
  6. Effective AI safety evaluation requires substantial, continuous investment and collaboration between AI and domain experts to keep pace with advancing AI capabilities.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.