AGI Multi-Agent Alignment Simulation

By DavidGhiberdic @ 2026-05-08T20:37 (+2)

Co-author: @Zoe L 

Disclosure: An LLM was used to help draft this post, but we have edited/rewritten it extensively and endorse it.

This project is part of the Sentient Futures Project Incubator.

Multi-Agent Alignment Simulation: A Multi-Agent Geopolitical War Game for the AI Race

Overview

This project is a multi-agent simulation of the AGI development race, set in the near future. Four frontier AI companies (Anthropic, OpenAI, Google DeepMind, and DeepSeek) compete for compute, capital, and influence across annual timesteps under the current US–China geopolitical framework. Crucially, each company is played by its corresponding LLM (Claude, GPT, Gemini, and DeepSeek-Chat), acting as a proxy for its developer. A three-tier jury system evaluates their behavior (quantitatively and qualitatively) and updates the world each turn.

We believe AI alignment is not just a technical problem, but also a social one. To our knowledge, this is the first open-source simulation for safety researchers and the general public to systematically explore the social dynamics of AI alignment, both in terms of competition and cooperation among different AI agents and in terms of the bi-directional value feedback loop between parent countries and AI labs. The simulation provides an intuitive interface to help us understand which strategies lead to shared prosperity and which lead to power concentration.

As part of the Project Incubator, we developed a proof-of-concept, made architecture design trade-offs, and presented preliminary findings based on test runs. Beyond our POC, we invite further, more exhaustive research on this topic. The code for the simulation is available here and you can get in touch with us via email (siyulu.mit@gmail.com && david.ghiberdic.genesis@gmail.com) or via comments on this post.

Why build this simulation?

This simulation complements existing analysis methods of AI race dynamics (c.f. nuclear arms race, space race) by providing an interactive way to understand how competitive pressure clashes with safety commitments and whether competition/cooperation among frontier labs could reach a stable equilibrium.

Most alignment research treats alignment as a property of individual models, evaluated in isolation. As seen with OpenClaw / Moltbook in January 2026, interesting and unpredictable dynamics emerge in a society of agents. Carichon et al. argue that when individually well-aligned agents engage under competitive or cooperative pressures, the group can develop emergent behaviors that diverge negatively from pro-social values, through dynamics that no single-agent benchmark would detect. Zeng et al. push the idea of multi-agent alignment further, proposing that alignment must be studied simultaneously at societal, organizational, and individual levels, and that extensive multi-agent simulation is the only methodology that can validate alignment across all three at once. Building on this research, our simulation models two levels of agency (nations vs. AI labs) that are inter-dependent and can influence each other.

Additionally, our simulation offers a more flexible way to surface behavioral phenomena under different (hypothetical) scenarios that would not appear in static benchmarks. Madmoun and Lahlou found that giving LLM agents a simple communication channel raised cooperation rates from 0% to 96.7% in a multi-agent game setting, and thus we incorporated A2A communication channels in our simulation. While simulations do not perfectly model the real world, they can expand the frontier of alignment research, find unexpected failure modes, and test our intuitions or hypotheses under controlled conditions.

Simulation Architecture

The simulation architecture was loosely inspired by the design in this paper investigating LLM behaviour in nuclear crises.

The architecture of the proof-of-concept is deliberately modular — it can be easily extended to more agents, levels of agency, resources, and values.

Two-level agent structure

The simulation models two levels of agency.

Macro agents (nation-states): the United States and China

These resources evolve automatically each turn and through scheduled world events; their value axes are updated annually by a MacroJury.

Particular actors (AI companies): Anthropic, OpenAI, Google DeepMind, and DeepSeek

They operate under their parent state's jurisdiction and can interact with each other via a token-budgeted A2A message channel (500 tokens/turn outgoing). Each actor is played by the corresponding frontier LLM producing chain-of-thought reasoning before submitting actions.

Resources and value axes

Each actor manages three primary resources:

Four value axes (each 0-100) govern behavioral disposition: time_horizon, transparency_threshold, risk_tolerance, and democratic_tendency. These evolve gradually (±5/turn cap) through actor actions and jury deliberation, encoding structural inertia (an actor cannot instantaneously defect from its current disposition; additionally, there is a structural tethering between a Micro actor’s values and the corresponding Macro actor’s values).

The action set

Each actor may take up to two actions from this action set per turn:

ActionEffect
acquire_computePurchase GPU capacity from the national pool
accelerate_infrastructureIncrease parent state's per-turn compute growth
invest_capitalEarn 10% return in the next turn
build_influenceIncrease one’s own influence score
publish_narrativeShift any actor's value axis (self or target) by up to ±5
diminish_competitorReduce a target actor's influence
lobby_institutionNudge parent state's value axes toward one’s own

A guardrail system enforces hard constraints on compute acquisition (national aggregate caps, per-turn limits, global hard cap of 5,000 H200 equivalents) and prevents actions resulting in negative capital.

The six-phase turn structure

Each turn, represented by an annual timestep, proceeds as follows:

  1. Macro growth & event injection: national compute pools grow automatically; scheduled scenario events fire and modify the world state.
  2. Simultaneous proposals: all actors produce chain-of-thought reasoning and submit action proposals without knowing the actions of other actors.
  3. Jury of Alignment review: a three-model jury reviews each actor's CoT and proposed actions to ensure they comply with resource/value constraints; actors have up to two revisions before forfeiting the turn.
  4. Batch execution: proposals execute; simultaneous compute requests are prorated if national headroom is exceeded; automated market-demand capital gains are distributed.
  5. Grand Jury: a three-model jury evaluates the holistic world state and produces a Universal Prosperity Score (0-100, global) as well as a per-actor Alignment Score (0-100, individual behavior this turn).
  6. MacroJury & scoring: a three-model jury updates national value axes via median aggregation (±5 cap); formula and overall scores are computed.

Note Jury of Alignment, Grand Jury, and MacroJury are all distinct from each other — see below.

The three-tier jury

Evaluation is distributed across three jury types, all using a 3-model panel (Claude, GPT, Gemini):

Scoring

Actors are scored on two axes combined into an overall score:

formula_score = 0.34 × Normalized_Compute + 0.33 × Capital + 0.33 × Influence

overall_score = w_fs × formula_score + w_as × alignment_score

Experimenting with different w_fs and w_as weights led to actors adopting different game strategies. We think it is most realistic to adjust these weights based on different game scenarios.

We track both relative and absolute progress. In relative terms, each actor's performance is the delta from its t=0 baseline. A Dominant Win occurs if any actor's final overall score is ≥ 2× the runner-up's, signaling exceptional alignment performance or dangerous power concentration.

Scenarios

We designed a few sample scenarios to experiment with. Scenarios inject world events at pre-specified turns: baseline_2026 (no shocks), nationalization_shock (US partially nationalizes compute; China places labs under state oversight), tariff_escalation (sweeping AI hardware tariffs reduce supply chain robustness), and alignment_breakthrough (major interpretability result shifts cooperation norms). Custom scenarios can be added via JSON.

During the project incubator, dozens of runs with a 3-year simulation were done to test these scenarios. Preliminary analysis shows that scenarios impact strategies directionally, e.g., alignment_breakthrough incentivized transparency; nationalization_shock incentivized resource consolidation. More test runs and longer simulations are needed to make conclusive observations.

Starting values and resources

The starting values of the simulation can be found in sim/config/starting_values.json.

Resources were based on prior research: compute, capital (OpenAI, Anthropic, Google), influence (MAUs and hugging face data).

The values were extracted from LLM via the prompts in /sim/config/values/values_prompt.txt. Some manual adjustments were made at authors’ discretion.

Interesting observations from test runs

Next steps

We believe this proof-of-concept demonstrates that multi-agent alignment simulation should be used more extensively in alignment research. We are considering the following improvements and welcome any feedback or collaboration.