New AI safety treaty paper out!

By Otto @ 2025-03-26T09:28 (+28)

Last year, we (the Existential Risk Observatory) published a Time Ideas piece proposing the Conditional AI Safety Treaty, a proposal to pause AI when AI safety institutes determine that its risks, including loss of control, have become unacceptable. Today, we publish our paper on the topic: “International Agreements on AI Safety: Review and Recommendations for a Conditional AI Safety Treaty”, by Rebecca Scholefield and myself (both Existential Risk Observatory) and Samuel Martin (unaffiliated).

We would like to thank Tolga Bilge, Oliver Guest, Jack Kelly, David Krueger, Matthijs Maas and José Jaime Villalobos for their insights (their views do not necessarily correspond to the paper).

Read the full paper here.

Abstract

The malicious use or malfunction of advanced general-purpose AI (GPAI) poses risks that, according to leading experts, could lead to the “marginalisation or extinction of humanity.”^[1] To address these risks, there are an increasing number of proposals for international agreements on AI safety. In this paper, we review recent (2023-) proposals, identifying areas of consensus and disagreement, and drawing on related literature to indicate their feasibility.^[2] We focus our discussion on risk thresholds, regulations, types of international agreement and five related processes: building scientific consensus, standardisation, auditing, verification and incentivisation.

Based on this review, we propose a treaty establishing a compute threshold above which development requires rigorous oversight. This treaty would mandate complementary audits of models, information security and governance practices, to be overseen by an international network of AI Safety Institutes (AISIs) with authority to pause development if risks are unacceptable. Our approach combines immediately implementable measures with a flexible structure that can adapt to ongoing research.

Treaty recommendations

(Below are our main treaty recommendations. For our full recommendations, please see the paper.)

The provisions of the treaty we recommend are listed below. The treaty would ideally apply to models developed in the private and public sectors, for civilian or military use. To be effective, states parties would need to include the US and China.

Establish a compute threshold above which development should be regulated.
- Compute thresholds are the most effective “initial filter” to identify models that may present significant risks.^[3]
- We think that AI Safety Institutes are best placed to specify and revise this threshold, and to decide how it is calculated.^[4]
- Acknowledging the limitations of compute thresholds, AISIs may also incorporate other metrics in thresholds.
- Contrary to some proposals, we do not propose exemptions for an international research institute.^[5]
Require “model audits” (evaluations and red-teaming) for models above the threshold.
- These audits should include evaluations conducted by AISIs or other public bodies with grey or white box access to models. Evaluations should occur during development at intervals approved by AISIs.^[6]
- AISIs should be convinced that a model does not pose an unacceptable risk, whether due to loss of control or other malfunction risks, or due to malicious use. If AISIs determine that a model poses unacceptable risk, states parties should ensure its development is paused.
- AISIs may also require developers to present safety cases, using a range of evidence to prove that risks are kept below quantified levels, as is required in other high-risk industries.^[7]
Require security and governance audits for developers of models above the threshold. Based on practices in high-risk industries, these audits could be performed by accredited private bodies. Governance audits should assess whether a developer has sufficient risk management procedures and a “safety culture”.^[8]
Impose reporting requirements and Know-Your-Customer requirements on cloud compute providers. To the best of their ability, providers should report the amount of compute consumed by their customers to a designated authority.
Verify implementation via oversight of the compute supply chain. Verification methods include monitoring transfers of AI-related hardware through customs and financial data analysis, and requirements to report sales and purchases; remote monitoring to identify data centres and energy consumption patterns; hardware components to track chips and their activities (contingent on further research); and on-site routine or challenge inspections. It may be beneficial for an international institution to conduct inspections, to reduce inter-state suspicion and minimise potential for conflict. See Section 1.4.4 for a discussion of verification methods. States parties will bear ultimate responsibility for enforcing the treaty, including pausing model development where required.

Existential Risk Observatory follow-up research

Systemic AI risks

We have worked out the contours of a possible treaty to reduce AI existential risk, specifically loss of control. Systemic risks, however, such as gradual disempowerment, geopolitical risks of intent-aligned superintelligence (see for example the interesting recent work on MAIM), mass unemployment, stable extreme inequality, planetary boundaries and climate, and others, have so far been out of scope. Some of these risks are, however, hugely important for the future of humanity, too. Therefore, we might do follow-up work to address these risks as well, perhaps in a framework convention proposal.

Defense against offense

Many AI alignment projects seem to be expecting that achieving reliably aligned AI will reduce the chance that someone else will create unaligned, takeover-level AI. Historically, some were convinced that AGI would directly result in ASI via a fast takeoff, and such ASI would automatically block other takeover-level AIs, making only alignment of the first AGI relevant. While we acknowledge this as one possibility, we think aligned AI that is powerful enough to help with defense against unaligned ASI, yet not powerful enough or unauthorized to monopolize all ASI attempts by default, is also a realistic possibility. In such a multipolar world, the offense/defense balance becomes crucial.

Although many AI alignment projects seem to rely on offense/defense balance favoring defense, so far little work has been done on aiming to determine whether this assumption holds, and in fleshing out what such defense could look like. A follow-up research project would be to try to shed light on these questions.

We are happy to engage in follow-up discussion either here or via email: info@existentialriskobservatory.org. If you want to support our work and make additional research possible, consider donating on our website or by reaching out to the email address above, since we are funding-constrained.

We hope our work can contribute to the emerging debate on what global AI governance, and specifically an AI safety treaty, should look like!

^{^}
Yoshua Bengio and others, International AI Safety Report (2025) <https://www.gov.uk/government/publications/international-ai-safety-report-2025>, p.101.
^{^}
For a review of proposed international institutions specifically, see Matthijs M. Maas and José Jaime Villalobos, ‘International AI Institutions: A Literature Review of Models, Examples, and Proposals,’ AI Foundations Report 1 (2023) <http://dx.doi.org/10.2139/ssrn.4579773>.
^{^}
Heim and Koessler, ‘Training Compute Thresholds,’ p.3.
^{^}
See paper Section 1.1, footnote 13.
^{^}
Cass-Beggs and others, ‘Framework Convention on Global AI Challenges,’ p.15; Hausenloy, Miotti and Dennis, ‘Multinational AGI Consortium (MAGIC)’; Miotti and Wasil, ‘Taking control,’ p.7; Treaty on Artificial Intelligence Safety and Cooperation.
^{^}
See Apollo Research, Our current policy positions (2024) <https://www.apolloresearch.ai/blog/our-current-policy-positions> [accessed Feb 25, 2025].
^{^}
For example, the U.S. Nuclear Regulatory Commission set a quantitative goal for a Core Damage Frequency of less than 1 × 10⁻⁴ per year. United States Nuclear Regulatory Commission, ‘Risk Metrics for Operating New Reactors’ (2009) <https://www.nrc.gov/docs/ML0909/ML090910608.pdf>.
^{^}
See paper Section 1.2.3.

PipFoweraker @ 2025-03-27T06:47 (+7)

Excellent paper with a good summary / coverage of international governance options. One hindrance to international cooperation that other regulatory movements have come into has been the challenge in getting dominant players to cede their advantaged positions or to incur what they would see as disproportionate costs. The challenge around getting large economies & polluters to agree to meaningful carbon measures is indicative.

In the AI space, a few actors large State both perceive each other to be strategic competitors, have a disproportionate amount of access to talent, capital and hardware, and have recently signalled a decrease in willingness to bind themselves to cooperative ventures. Thinking particularly here of the US's difference in signalling coming out of the Paris Summit compared to their positions at Berkeley and Seoul.

One possible way that other State level actors in the space could seek to influence the forerunners at the moment is covered by Christoph, Lipcsey and Kgomo writing commentary for the Equiano Institute. Their argument is that already existing groups like the EU, African Union and ASEAN can combine to provide regulatory leadership, inclusive AI development and supply chain resilience, respectively. I think this strategic approach dovetails nicely with challenges raised in the paper about how to effectively motivate China and the US from engaging with multilateral treaties.

Paper URL: https://www.equiano.institute/assets/R1.pdf

Otto @ 2025-03-27T12:36 (+2)

Thank you for the compliment, and interesting approach!