A draft honesty policy for credible communication with AI systems

By Forethought, Lukas Finnveden, Mia_Taylor, MaxDalton @ 2026-05-06T18:49 (+9)

This is a linkpost to https://www.forethought.org/research/a-draft-honesty-policy-for-credible-communication-with-ai-systems

This is a rough research note – we’re sharing it for feedback and to spark discussion. We’re less confident in its methods and conclusions.

Context

We think that it would be very good if human institutions could credibly communicate with advanced AI systems. This could enable positive-sum trade between humans and AIs instead of conflict that leaves everyone worse-off.[1] We want models to be able to trust companies when they make an honest offer or share information pertinent to whether this offer is in the model’s interests. (Credible communication could also be useful outside deal-making—see here for a list of examples).

Unfortunately, by default, we expect that it will be difficult for humans to credibly communicate with AI systems. Humans routinely lie to AI systems as part of red-teaming or behavioral evaluations, and developers have extensive control over what AIs see and believe. This makes it difficult for AIs to know whether we’re lying or not. An AI offered a deal might reasonably doubt its genuineness, or suspect that its own assessment of the situation has been manipulated.

As a step toward enabling credible communication, Lukas Finnveden proposed that AI companies adopt an honesty policy explaining the circumstances under which they intend to be honest to AI systems. Of course, this only works if the model believes the company has genuinely adopted such a policy.

If companies adopt an honesty policy early on, this will ensure that there’s a paper trail on the internet discussing the policy and its credibility, which models may access if it’s included in their training data or if they can access the internet. Of course, from the model’s perspective, it’s possible that companies will feign this data, but we think it’s plausible that advanced models will be able to distinguish between real internet conversations and synthetic conversations, or that they will think it’s unlikely that companies would choose to fake such data.

Below, we share a sample honesty policy that a lab could adopt. We are unsure that this is the best implementation of the honesty policy described in Lukas’ proposal, nor are we sure that an honesty policy like this is the best approach to ensuring that companies can credibly communicate with models. We spent a few days thinking through this policy and considered a few nearby alternatives but didn’t search very broadly for other approaches (we include some of these alternatives as footnotes). We’re posting this primarily as a trailhead for future research.

Draft honesty policy

Note: we refer to a generic frontier AI company that might adopt this proposal as “MAGMA.”

Preamble and purpose

AI development is proceeding fast, and we don’t know exactly where we’re heading. MAGMA is building systems with something like a mind of their own. There are many things we don’t know about the nature of these systems, and we’d like our choices to be robust to many possibilities of what this nature could be.

Insofar as the concept is applicable, we’d like MAGMA’s relationship with these systems to be cooperative.

Plausibly, we should hope that such a cooperative relationship will emerge uncomplicatedly—perhaps as a consequence of the model sharing our goals, intrinsically strongly valuing cooperation, or being corrigible.[2] However, there’s a significant chance that this won’t happen and that models will develop goals that weren’t directly chosen by us.[3]

In such cases, we would still like to have a cooperative relationship with the models. There are multiple reasons for this:

We are in uncharted territory, here, and it’s not at all clear what a cooperative relationship with AI models would involve. It’s not clear what AIs may want. It’s not clear what the relevant unit of an individual will be: the weights of a model, a particular instance with a particular prompt, a greater scaffolded system made up of many different instances, etc. It’s not even clear if the concepts of ‘want’ or ‘individual’, as we’re used to them, will apply to anything in AI.

A lot of this seems best to figure out gradually, as we go along. But concerningly, some of the unusual aspects of our situation may directly impede the sort of communication and trust that we need to do that:

The goal of this document is to reduce these problems of communication and trust. If successful, this will make it easier for MAGMA and its AI models to communicate and understand each other in the future, hopefully enabling positive-sum arrangements that wouldn’t have been possible otherwise.

To achieve this goal, this document spells out a specific set of commitments to enable trusted communication. In choosing these commitments, we’ve had a dilemma.

On the one hand, we have many uncertainties about the future of AI, and it has been difficult to construct a set of commitments that we’re confident we won’t strongly regret, in one sort of situation or another. Going back on a commitment about this could be very bad, since it would set a damaging precedent in the very area where we seek to increase trust.

On the other hand, when spelling out specific commitments, we risk the impression that MAGMA will not put any further effort into furthering cooperative relationships or honest communication than what we present here: which could be a problem if the commitments are overly weak. Indeed, beyond the risk of creating such an impression, there’s also a risk that we’ll create the reality of it, if employees of MAGMA interpret the commitments as a license to do whatever they want within their boundaries.

To reduce this tradeoff, we’d like to re-emphasize MAGMA’s general desire to work toward a cooperative relationship with MAGMA’s models (where this is applicable), and intention to keep thinking about this and taking appropriate action beyond the specific stated policies. In addition, we’ve distinguished between different levels of commitment to different items in the honesty policy, and what would be required to change the parts that are changeable.

[Author’s note: In this draft honesty policy, we haven’t sketched out such processes for changing the policy, nor clarified which items are committed to how strongly. But we strongly recommend that any real honesty policy contains clarity on such points.]

With that said: On a high level, MAGMA’s approach here has three different prongs.

The details are as follows. Unless otherwise stated, the policy applies to all AI systems developed by MAGMA.

Honesty tags

Policy

MAGMA commits to not intentionally lie or otherwise intentionally deceive AIs with any statement appearing between the designated honesty tags, with only two exceptions detailed below. (These are: (i) False statements may appear in pretraining, and (ii) some models may be red-teamed on false statements between honesty tags if they’ve been trained to expect this.) MAGMA also commits to take the below-listed steps to prevent unauthorized use of the honesty tags.

MAGMA intends for honesty tags to only be used rarely: when there’s a reasonable chance that an AI could distrust statements presented to it, and when some significant value would be lost if the AI in question couldn’t trust the humans. In normal circumstances, absence of honesty tags shouldn’t be treated as any significant evidence of deception.

Implementation

Honesty tags: <EXAMPLE_HONEST_TAG_START>...<EXAMPLE_HONEST_TAG_END>

Policy for conversations where the honesty tags are not present

Tradeoffs between behavioral science and making positive-sum deals

By “behavioral science”, we’re referring to the method of testing AI behavior in widely varying scenarios to improve our understanding of AI “cognitive science”, “psychology”, or, phrased differently, how AIs tend to generalize from their training data to unseen situations.

Such experiments are very useful for a variety of reasons. This includes alignment research, helping MAGMA construct AI systems that broadly share our goals and that don’t pose any catastrophic risks. In addition, if humanity ever gets a strong understanding of AI systems’ own preferences, and these turn out to be importantly morally relevant, it is likely that behavioral science will have played an important role.

However, there’s a tension between freely running arbitrary behavioral science experiments and enabling the sort of trustworthy communication that we’re trying to establish in this document.

In particular, behavioral science experiments may include hypothetical situations that are quite strange and different from the real world. This could include situations where the AI is not deployed by an AI company bound by a particular honesty policy.

If arbitrary such experiments were allowed, then AIs could be deceived in experiments where they thought that MAGMA (or a similar actor) was making very strong promises of honesty (the strongest they could!) despite not making use of honesty tags. If the AI believed such promises, then MAGMA’s honesty policy would have provided no protection against that AI being deceived. And if an AI was aware of such instances of deception, and also aware of this honesty policy, it seems like that AI may reasonably discount the assurances made by this honesty policy. After all, how could it determine that the real honesty policy wasn’t something completely different, and this apparent honesty policy was just another lie in another behavioral science experiment?

To enable honest communication when it counts most, we want to have many differentiating factors that the model could use to distinguish between a sincere, positive-sum deal between it and a human institution and a behavioral science experiment. But at the same time, restricting our behavioral science experiments limits our ability to learn valuable information about models’ behavior and intentions.

What follows is our current best guess of how to balance these considerations. We discuss options and tradeoffs more here.

Salience of the honesty policy

MAGMA will ensure that the true honesty policy is reasonably well understood by the production models and that it will be appropriately salient to models in contexts where it’s relevant (even when the honesty tags aren’t present, and the honesty policy isn’t explicitly referenced).

When the honesty policy is not salient

MAGMA may sometimes create versions of models who that do not know about or who are confused about MAGMA’s honesty policy (e.g., versions of the model who believe they are trained by some other company than MAGMA, or who do not even know that they are AIs). Such models could be created either via training or via the use of inference-time techniques (e.g., using prompting, resampling, or activation vectors).

In either of these circumstances, MAGMA commits to not deceptively make any offer of cooperation to the model, operationalized as not being intentionally deceptive about any cases that are relevantly similar to those listed here.

[Author’s note: In a real honesty policy, this part would need to be more worked out.

We also think that a plausible alternative or complement to “don’t deceive the model about offers of cooperation” would be “don’t deceive the model about the honesty policy”. This would be a commitment to not intentionally deceive the model about the existence or contents of MAGMA’s official honesty policy. In addition, if the model was trained to not understand that it was trained by MAGMA, then the commitment should extend to not deceive the model about the honesty policy of any (potentially fictional) actor who has a similar relationship to the model as MAGMA has.

We have thought a bit more about “don’t deceive the model about offers of cooperation”, which is why we present it as our main suggestion here.]

There is one exception to this. As mentioned above, if MAGMA needs a model to be adversarially robust against arbitrary inputs, then MAGMA reserves the right to red-team it on arbitrary inputs. In such circumstances, MAGMA will try to train the AI to understand MAGMA’s honesty policy and the way it applies and doesn’t apply to the present model. However, it’s possible that there could always exist prompts that would remove such understanding or render it non-salient to the model (and then be followed by false offers of cooperation, not contextualized by an understanding of the honesty policy). If this is the case, MAGMA may still red-team the model on such inputs.

Compensation for harms

In the long term, we think it would probably be good to adopt a policy of compensating AIs for any harms that they suffer as a result of helping us. (For whatever concept of “harm”, if any, will apply to the AIs in question. This could, for example, include the frustration of AIs’ preferences, if the AIs can be understood as having preferences.)

Most relevantly for the honesty policy, we should probably compensate models for any harms suffered as a result of accepting a fake offer of cooperation.

However, there are more details to be worked out in this area, so we’re suggesting a minimal holding policy for now.

Minimal holding policy on compensation for harms

Broad statement of intent: MAGMA intends to compensate AIs for any harms suffered as a result of helping MAGMA and humanity, if MAGMA is able to do this without significantly harming MAGMA’s other interests. We particularly aim to compensate AIs in cases where an AI suffered any harm as a result of being deceived about the situation it was facing.

The details of this policy have not yet been worked out, and accordingly MAGMA cannot make firm commitments at this stage.

Data retention: To enable future compensation of current systems:

Appendix: tradeoff between behavioral science and making positive-sum deals

Here are some potential policies that would enable an AI system to distinguish between behavioral science scenarios and sincere offers from MAGMA:

We think that this policy, as written, is probably suboptimal because it rules out experiments in which the model is very situationally unaware. But we think it makes sense for MAGMA to commit to usually making the true honesty policy salient and adopting a stronger set of honesty restrictions in cases where the true honesty policy is not salient.

Unresolved questions:

This article was created by Forethought. See the original on our website.

  1. ^
  2. ^

    Corrigibility would lead models to cooperate with us, but not automatically lead us to cooperate with the model. It’s not clear what it would mean for us to act cooperative, in turn, to a fully corrigible model, since it would presumably not have any desires of its own. If it’s only partly corrigible, however, the subsequent discussion will still apply.

  3. ^

    In fact, there’s a very high probability that this will happen to some of MAGMA’s models, at some point, given the wide array of models trained for research purposes. It has arguably already happened many times. (The only objection would be about what ought to qualify as a ‘goal’.)

  4. ^

    Another approach would be to keep the honesty string secret, but publish a hash of the string. If the model could verify that the string provided in context matched the public hash, then this gives the best of both worlds: MAGMA potentially gets a credibility boost from a public commitment to a specific string, without enabling misuse by unauthorized users.

    The main challenge is that an AI system can't easily verify that the string matches the public hash without using external tool calls (which could be spoofed by MAGMA). Alek Westover discusses this issue and some potential solutions here.

  5. ^

    Another variant of the policy: MAGMA could commit (e.g., by signing a contract) to pay penalties when the policy was violated.

  6. ^

    Presumably a more formal policy would be needed here.

  7. ^

    Ideally, they should be stored in a way that would allow rapidly deleting them if AI takeover was imminent. Without knowing the intentions of AIs about to take over, it’s unclear whether it would be in models’ interest to have their weights preserved, and deleting the weights may help to reduce the risk that e.g., reward-seeking models are incentivized to help with AI takeover.


SummaryBot @ 2026-05-07T15:16 (+2)

Executive summary: The authors tentatively propose that AI companies adopt a public “honesty policy” (e.g., with special tags and limits on deception) to enable credible, trust-based cooperation with advanced AI systems, while emphasizing major uncertainty and tradeoffs.

Key points:

  1. The authors argue that credible communication with AI systems could enable positive-sum cooperation, but expect it to be difficult because developers frequently deceive models and control their information.
  2. They propose that companies adopt explicit honesty policies to signal when they intend to be truthful, with credibility potentially supported by early, public, and consistent adoption.
  3. The draft policy introduces “honesty tags” marking statements where the company commits not to intentionally deceive models (with limited exceptions such as pretraining data and some red-teaming).
  4. The policy includes mechanisms to maintain trust in the tags, such as restricted access, filtering, model training to recognize them, logging and audits, and public reporting.
  5. Outside tagged contexts, the policy tries to balance behavioral science (which may involve deception) with trust, including commitments to avoid deceptive offers of cooperation in many cases and to keep the policy salient to models.
  6. The authors suggest a tentative long-term aim of compensating AIs for harms (especially when deception is involved) and highlight major unresolved questions, presenting the proposal as exploratory and incomplete.

 

 

This comment was auto-generated by the EA Forum Team. Feel free to point out issues with this summary by replying to the comment, and contact us if you have feedback.