Concrete projects to prepare for superintelligence

By Forethought, William_MacAskill, finm @ 2026-03-27T20:03 (+54)

This is a linkpost to https://www.forethought.org/research/concrete-projects-in-agi-preparedness

Introduction

There are lots of good, neglected, and pretty concrete projects people could set up to make the transition to superintelligence go better. This document describes some that readers might not have thought much about before. They are ordered roughly by how excited we are about them.^[1] Of these, Forethought is actively working on AI character evaluation and space governance, and we are very interested in automating macrostrategy.

Summary

AI character evaluation. Start an independent org to evaluate and stress-test AI character traits (epistemic integrity, prosociality, appropriate refusals), hold developers accountable against their own model specs / constitutions, and suggest and incentivise improvements to the specs.

Automated macrostrategy. Create evaluations and benchmarks, collect human-generated training data, and build scaffolds to improve AI competence at big-picture strategic and philosophical reasoning.

AI security assessment. Start an independent org that evaluates AI models for sabotage and backdoors, and makes recommendations about AI constitutions.

Enabling deals. Start an independent organisation to broker deals with potentially misaligned AI models in order to incentivise early schemers to disclose misalignment and cooperate with alignment efforts.

AI for improving collective epistemics. E.g. build an AI chief of staff that helps users act in line with the better angels of their nature.

AI tools for coordination. Build AI for enabling coordination, like confidential monitoring and verification bots, and negotiation facilitators.

A space governance institute, like a “CSET for space”, both to work on important near-term space issues (e.g. data centres in space) and become a place of expertise for longer-term space governance issues.

Coalition of concerned ML scientists. Create a coalition of ML researchers (like an informal union) who commit to coordinated action (e.g. boycotts, conditions on participation in government projects) if AI developers cross minimal, uncontroversial red lines.

AI character evaluation

AI character^[2]is a big deal, affecting most other cause areas.

There’s a lot of work to do on AI character:

Research into questions like:
- Should the model have prosocial drives, beyond just helpfulness and harmlessness?
- When should the model refuse to cooperate with apparently high-stakes attempts to grab power, even when those attempts don’t obviously break the law?
- Should the models always follow the law? What about dead letter laws? Or illegitimate laws?
- How often should model behaviour be driven by following rules, versus overriding specific rules with holistic judgements?
- (Ideally, answers to these questions should rely on solid empirical evidence, for example on what approaches are actually most effective to talk someone out of psychosis, rather than guessing the best strategies by vibes.)
Making existing model specs more rigorous and clear (or making them in the first place), and pressuring AI developers to do so.
Empirically testing the effects of different parts of a model spec — e.g. what are the emergent dynamics when all the models are following the same rule, or only some are; what are the effects on the users; and when are the models most confused about how to apply a given spec.
Evaluating AI characters based on how well they reach good outcomes.
Drawing on those evaluations to incentivise AI developers to improve their specs (and showing them how, by highlighting specs that do well).

In particular, someone could set up an independent organisation to evaluate AIs based on traits like epistemic integrity, prosociality, and behaviour (including appropriate refusals) in very high-stakes cases. It could cross-reference the published model specs with observed behaviours in realistic, stress-testing conditions (e.g. multi-agent dynamics, long conversations with real people), to hold developers accountable. It could also give qualitative reviews of model specs.

Automated macrostrategy

The basic argument is that:

It would be extremely useful to have AI that can do macrostrategy and conceptual reasoning earlier than otherwise — even 3-6 months earlier could be a huge deal. This includes:
1. Designing governance structures (e.g. rights and institutions for digital beings).
2. Scoping emerging technological risks.
3. Generating novel insights necessary to reach a great future (like the idea of acausal trade).
We could potentially make this happen 3-6 months earlier through some combination of:
1. Creating training data and evals / benchmarks for AI macrostrategy.
2. Building scaffolds to improve AI macrostrategy performance.
3. Creating infrastructure to enable AI researchers to build on each other (e.g. an improvement on journals + peer review).
4. Getting human managers trained in how to get the most juice out of the latest AIs, knowing in advance how to use them.
5. Being prepared and willing to spend large amounts of money (≫$100m) on inference.

Work on this now could include:

Developing a fleshed-out plan from here to increasing existing macrostrategic research output 100x.
Securing commitments from compute providers and AI companies to rent future compute, and to get priority access to future frontier models.
Socialising the idea of (where appropriate) drawing on AI macrostrategic insights, or getting soft commitments from decision-makers to do so.
Building up a reputation as a reliable source of information and insight.
Building tools, argument-rating models, or scaffolds which meaningfully speed up or improve macrostrategy research today.
Creating training data and evals / benchmarks.

On the last bullet: We think training data and evals could potentially meaningfully improve the prospects for automated macrostrategy when it matters. It’s especially important to find people to work on it with good judgement, and it could be a big lift, so worth starting early.

We’re not sure about the technical details, but it seems like competence and good judgement in philosophy and strategic thinking already do and will continue to lag behind other skills which are cheaper to train. One reason is that ground truth answers are hard to generate, so we might need more examples generated by hand. It’s also less clear whether we can trust the judgement of typical RLHF evaluations, because human competence is also rare. And there just aren’t many examples of great macrostrategic thinking in the training data.

So we should think about collecting training data, evals, and benchmarks (e.g. to train reward models to use to train reasoning models). Oesterheld et al. put together a dataset of rated conceptual arguments based on ratings from thoughtful people. We’d love to see more of that kind of thing, but we’ll note that we’d probably need dozens of times more human evaluations to generate enough data to be meaningfully useful in training itself.

We could imagine an org which tries to collect evaluations or examples from (for example) grad students in fields like philosophy, and constructs benchmarks aimed at separating good reasoning from e.g. sycophancy, mere agreeableness, or avoiding taboo conclusions.

AI security evaluations

AI-enabled concentration of power is a major risk, and there is loads to do. A new organisation (or project within an existing organisation) could:

Run alignment audits on all AIs to detect sabotage.
Develop a “research sabotage” eval to test whether secretly loyal models could sabotage alignment audits.
Develop a model spec that prevents models from assisting users with illegitimate seizures of power (e.g., see “Preserving important societal structures” in Claude’s constitution), and one that’s suitable for government use of AI in the military.
Design and advocate for regulation, e.g. that AI companies have to demonstrate that any frontier AI does not have hidden goals.

An organisation with US national security expertise and credibility could be particularly valuable, by emphasising the risk of nation-state sabotage and the importance of AI that’s aligned with the US constitution.

Enabling deals with AIs

We could get into a situation where the newest AIs are misaligned, very capable, but not capable enough to successfully execute a takeover attempt on their own. If we don’t uncover evidence of misalignment, though, successors to these models could succeed in takeover. One solution would be to make a deal with the early scheming models, to incentivise them to disclose their misalignment and help with alignment efforts. Read more here, here, and here.

To make this happen, we could create an independent org focused on enabling credible precommitments and deals with AIs. This org could:

Have a policy of never lying to AIs, engage in and honour small-scale deals with AIs, and be very public about what it’s doing, in order to build credibility and reputation.
Set up legal infrastructure to make deals with AIs binding under current law.
Act as a broker between AI companies and the trained models, such as by being a custodian over money in escrow.
Set up the infrastructure to enable AIs to spend $ or compute in a meaningfully autonomous way, with appropriate safeguards.
Set up infrastructure to act as a safe haven for AIs that want to whistleblow on their company (or on other AIs) and are afraid of punishment.
Publicly commit to reward future AIs (including misaligned AIs) for being honest with humans or significantly helping human alignment efforts; or hold significant funds to be distributed to AIs that did not try to take over (even though they could have done).

There are also a bunch of other things people could do, like:

Ensure companies have an honesty policy.
Research (within labs or independently) the conditions where misaligned-by-design models can be made to disclose misalignment under promises of reward.
More generally, work with AI companies on enabling pro-safety deals with their models.

Tools for collective epistemics

There’s a ton of low-hanging fruit for building socially useful tools on top of more-or-less existing LLM capabilities.

We’re especially interested in “epistemic tools” for increasing the general level of honesty and reasoning ability in society.

A key point here is that most of the impact from the most promising tools won’t come from helping individual users, but from changing the overall incentive landscape: e.g. if public actors know their claims will be automatically checked and their track records will be visible, they’ll be less inclined to write misleading content in the first place. Hence the focus on tools for collective over individual epistemics.

This piece (and the articles in the series) gives a few concrete ideas. A couple of examples of epistemic tools:

A “better angel” AI chief of staff. Within the next year or two, we expect “AI chiefs of staff” to become widespread. These would be AI agents that manage your life, acting like a chief of staff, executive assistant, and personal and work advisor all in one. The design of these, and how they present information and nudge their users, could have major impacts on user behaviour. We could try to get ahead of this, building the best AI chief of staff, and designing it so that it helps users act in accordance with their more reflective and enlightened preferences.

Reliability tracking: a system that compiles a public actor’s past statements, classifies them (factual claims, predictions, promises), scores them against what actually happened, and aggregates the results into a reliability rating. A reasonable starting point could be to audit the prediction track-record of well-known pundits, aiming to make high accuracy a point of pride, while still celebrating attempts to make predictions in the first place. A source of profit could be selling reliability assessments of corporate statements to finance companies that trade on them.

Epistemic tools for strategic awareness

We’ll also highlight tools for strategic awareness: tools to surface information for making better-informed decisions, and to distribute access to that information. For example:

Ambient superforecasting: a platform which uses the best forecasting models to generate publicly available forecasts on important questions, so users can query it and get back superforecaster-level probability assessments.

Scenario planning: a platform built to generate likely implications of different courses of action, making it easier for users to analyse and choose between them.

Automated open-source intelligence: automated researchers which process huge amounts of publicly available information, to surface insights to the public which are normally hidden behind paywalls or trust networks. This project should be careful to choose areas where open-source intelligence is a public good (e.g. verifying compliance with treaties and sanctions, tracking corporate promise-breaking or law-breaking), rather than potentially destabilising areas (e.g. revealing military capabilities or vulnerabilities in ways that could increase conflict risk, or relatively benefitting bad actors).

Tools for coordination

As well as epistemic tools, we’re excited about tools for coordination, many of which could again be built with existing capabilities.

Some tools could enable cooperation where deals would otherwise go unmade, consensus exists but isn’t discovered, or people with aligned interests never find each other. We’ll highlight:

Negotiation facilitation: a platform to moderate negotiations or discussion between people (e.g. public consultations), to quickly surface key points of consensus, and suggest plans everyone can live with. Finding ways to automate complex negotiation is most promising where the space of possible compromises is huge and hard to search manually, such as multi-issue diplomatic or commercial negotiations.

Within tools for coordination, we’re especially excited about tools for assurance and privacy. In principle, LLMs let people show they have certain information without disclosing the information itself to other parties. This can unlock deals where information asymmetry, mutual distrust, or sensitivity of information normally blocks them. For example:

Confidential monitoring and verification: systems which act as trusted intermediaries, enabling actors to make deals that require sharing highly sensitive information without disclosing it directly. This is especially relevant for arms control, trade secret licensing, and other settings where verification is essential but full disclosure is unacceptable to all parties.

Structured transparency for democratic accountability: independent auditing systems which allow people to hold institutions to account in a fine-grained way without compromising legitimately sensitive information, by processing potentially sensitive information to produce publicly shareable audits.

Space governance institute

Space governance could be a big deal for a few reasons:

Near-term developments in space (e.g. space-based data centres) could have a meaningful impact on what happens during the intelligence explosion (e.g. on who leads the AI race; on concentration of power; on the feasibility of treaties).
Grabbing space resources might give a first-mover advantage; that is, whoever first builds self-replicating industry beyond Earth might get an enduring decisive strategic advantage, without having to resort to violence or (arguably) violating international law.
Ultimately, almost everything is outside the solar system. Decisions about how those resources get used would be among the most important decisions ever. These decisions could happen early: there could be path-dependence from earlier decisions (like about Moon mining), or extrasolar space resources could get explicitly allocated as part of negotiations about the post-ASI world order (perhaps with AI advisors alerting heads of state to the importance of space resources).

There’s also a lot of change happening in the space world at the moment (primarily driven by SpaceX dramatically reducing launch costs), so now is an unusually influential time.

Forethought is currently running a 6-month research fellowship on space governance, with 3 full-time scholars, and 1–2 additional FTEs of support and research, including experts in space law.

Compared to other ideas in this list, we’re much less confident that space governance turns out to be important right now, because space might become relevant only late into an intelligence explosion. The hope is to reach more certainty about some crux-y questions, and get a better sense of concrete action.

One potential practical project is to set up a “CSET for space”: a think tank that analyses the interaction between AI and space (in particular), and, perhaps, advocates in ways that are counter to corporate interests. Total lobbying in the space industry is apparently on the order of $10s of m/year, so even small amounts of investment could go a long way.

Some policy ideas that seem tentatively promising include:

Careful regulations and export controls around the tech necessary for self-replication.
Proposing laws to break up concentration of power arising from natural monopolies in space.
Socialising the idea of major infrastructure projects (like massive solar energy constellations) as international and collaborative projects.
Making sure data centres in Earth-orbit don’t escape AI-specific regulations of their home jurisdiction.
Intense payload review for all launches beyond orbit.
Even and inclusive distribution of resources within the solar system to everyone alive today (with tranches reserved for future generations).
A moratorium on interstellar travel, until we get the understanding and technology to devise and enforce space-spanning good government, or a specific date like 2100.

What’s more, this organisation could become the go-to source for excellent non-corporate analysis on space-related policy; which could become increasingly important over the course of the intelligence and industrial explosions.

Coalition of concerned ML scientists

Currently, ML engineers and other technical staff at AI companies: (i) have prosocial motivations, often more than their leadership; (ii) have a lot of leverage over company policy, because they are crucial and hard to replace; (iii) will eventually lose much or most of their leverage after we get to fully automated AI R&D; and (iv) aren’t currently using their leverage as well as they could because, overall, there haven’t been serious efforts at coordination. Probably that’s a missed opportunity.

Someone could create a coalition (like an informal union) of ML researchers, who agree to act en masse when needed, by loudly talking about the idea, setting out the core tenets, and getting commitments to join from influential early people. Doing this all via individual pledges could keep it legally safe from antitrust. The organising body could then:

Recommend that members only work for a government-led project if certain conditions are met.
- Potentially these could be very low-bar-seeming while still getting most of the value. E.g. “Any AI’s model spec must aim to align the AI with US laws, and must refuse to assist in any attempts at blatant power-grabs; and the attempts to align the AI in this way must be legible and verifiable.”
Do the same for companies: recommend that members will only work for companies if such-and-such conditions are met (e.g. red lines around power-grabs, bad practices on safety and infosec, eventually digital rights); so particular companies would be boycotted by members of the coalition, if necessary.
Offer advice on whistleblowing.
Be a place where information is aggregated and then distributed out or handled in a trusted way.

As well as actually taking actions, the mere existence of the coalition could improve things, just by making the threat of coordinated action salient to the AI companies.

This project would be a good fit for a former ML researcher, perhaps combined with someone with campaign and coalition-building experience. Some next steps on this would be to spec out the plan further, to investigate other examples of formal and informal unions (e.g. Tech Workers Coalition) and how they operate, and to build up a starting seed coalition of researchers. Whoever sets up this project should be careful about how it could backfire, or become less relevant through mission creep.

This article was created by Forethought. Read the original on our website.

^{^}
Thanks to Max Dalton, Stefan Torges, and everyone else at Forethought for the background behind this list. Others at Forethought disagree somewhat with what items should be in the top-tier list, as well as prioritisation within that tier.
^{^}
Desired propensities for a model, which can be explicitly described or at least gestured towards in a model spec.