Coverage-driven alignment - What 'Teaching Claude Why' can borrow from AV verification

By Yoav Hollander @ 2026-06-09T06:42 (+1)

This is a linkpost to https://blog.foretellix.com/2026/06/08/coverage-driven-alignment-what-teaching-claude-why-can-borrow-from-av-verification/

Cross-posted from The Foretellix CTO Blog. Also on LessWrong.

Epistemic status: A methodology import argument, not a result. I'm fairly confident that coverage-driven verification (CDV), borrowed from autonomous-vehicle development, has transferable structure for alignment training. I'm less confident about the hardest case (long-horizon RL), where I flag open problems rather than resolve them. I spent years introducing CDV into AVs and other complex systems, so that side is fairly solid. The alignment side is reasoning by analogy, so calibrate accordingly.

Why this might matter for EA readers: Alignment of long-horizon RL agents is plausibly the hard case - Evan Hubinger has argued such training tends to produce genuinely misaligned agents. Anthropic's recent "Teaching Claude Why" is encouraging, but Anthropic itself flags that it can't enumerate every scenario and that coverage of its safety-training distribution has room to improve. That gap is exactly what a mature safety-critical discipline (AV verification) already has systematic, self-correcting machinery for. The claim here is one of neglected tractability: A practical methodology validated under real liability pressure in another domain may be a strong force-multiplier on alignment training. It takes real work - which is partly why incentives matter - and nobody seems to be trying it. I also discuss the incentives question - what regulatory or liability loop, analogous to the one driving AV verification, would actually make alignment V&V happen: We need adoption, not just technique.

 

Summary: This post suggests that alignment training could benefit from coverage-driven verification. Anthropic recently reported that teaching Claude alignment rules (via pretraining-style next-token learning on alignment-related stories) is more effective than relying primarily on RL-style behavioral shaping. Some AV developers reached a related conclusion, but in addition tend to use a systematic, coverage-driven methodology for training and verification. I claim that alignment researchers should consider borrowing ideas from that methodology, giving specific proposals (for instance, on how to use and refine an explicit coverage map).

 

Background

 Anthropic’s discovery: Anthropic recently published Teaching Claude Why (expanded version). They found that training Claude on behavioral demonstrations barely helped. Training on constitutional documents + fictional stories via plain next-token prediction (which they call SDF) cut misalignment by 3x+, and the gains (evaluated in several ways) persisted through RL. The biggest lever was not showing the right behavior, but rather teaching the right reasoning and principles behind the behavior.

 SDF moves more of the normative burden into pretraining-style learning, reducing reliance on RL-based alignment shaping.

 The fact that Teaching Claude Why (TCW) made Claude perform much better on alignment evals, and that the improvements persisted through (moderate) RL training, seems like good news in the somewhat-bleak alignment landscape. So I started thinking about ways to further improve it, ideally making the resulting alignment persist through long-horizon RL (see below).

 Similar findings from AV-land: NVIDIA’s Alpamayo AR1 independently found a similar starting point for Autonomous Vehicles (AVs): Imitation learning is not good enough for safety-critical long-tail cases. Their solution: structured causal reasoning (“Chain of Causation”). Other “physical AI” companies are moving in a similar direction.

 What alignment might borrow from AV: There are differences between the two stories (for instance, unlike Anthropic, AR1 does use RL directly to teach better reasoning), but there are important similarities between the two areas. Both must handle safety-critical long-tail failures: An AV company could fail if it does not do very good Verification and Validation (V&V) on the various edge cases – some already have (e.g. Uber ATG). This is why they are moving to coverage-driven verification and training (called CDV below).

 The stakes are even higher for getting alignment right (alignment also has a security-like nature – more on this below).

 Note that TCW is already pretty systematic, and somewhat CDV-like – more on this below. The last chapter will list (my understanding of) what’s still missing and may be worth trying.

 Using coverage to project modularity on a non-modular system: In at least one sense, AV training and verification practitioners are ahead: They have landed on CDV as a systematic, self-correcting methodology. Note that AVs (and physical-AI in general) are increasingly trained end-to-end, and thus no longer have clear inter-module protocols to verify against (though Chain-of-Causation-like schemes help a bit).

 So CDV is used to project a systematic set of coverage dimensions on the System Under Test (SUT): How well does it handle various combinations of weather conditions, road types, other-actor behaviors and so on. This matters because you need some sort of a “map” (which you keep refining as you go), so you can talk about “areas” (to test, to fix, to avoid in deployment and so on).

 The hard case: Long-horizon RL (e.g. the AI CEO): Agents trained via long-horizon RL are a challenging test-case for alignment techniques. Evan Hubinger (co-author of TCW) previously argued in Alignment Remains a Hard, Unsolved Problem that long-horizon RL will tend to create genuinely misaligned agents. His “AI CEO” example shows how being a good business person inherently requires behaviors (withholding information, managing perceptions, strategic timing) that can get pretty close to misaligned behavior.

 I assume other factors (like deploy-time learning) can make alignment even harder. And the ongoing capabilities acceleration (e.g. using AI to build better AI) adds urgency.

 Thus, I’ll use the AI CEO (and similar future long-horizon AI systems) as my benchmark for alignment techniques. It is much harder than the moderate-RL, few-turns chat alignment problem described in TCW, precisely because such long-horizon RL will repeatedly bring the model into situations where strategic optimization and misalignment start to overlap.

 The next chapter will briefly sketch how CDV works, and how this relates to alignment. The last chapter will dive deeper into applicable CDV techniques (and possible problems).

 

Coverage-driven alignment – the basic idea

 How CDV works: For readers unfamiliar with coverage-driven V&V, Chapter 1 of my V&V method paper gives a compact overview of the key techniques (coverage dimensions discovery, checking, scenario generation / matching, iterative gap analysis etc.), originally developed for complex systems like electronic systems and AVs, but applicable more broadly.

 It further explains how the whole process of development of AI-based systems is converging towards something very similar to the V&V process: Find (or create) training examples to fix the current problem, use coverage to make sure they represent the “relevant dimensions”, train, validate, repeat. See this post for further details.

 The V&V method paper itself goes further: it proposes that a future AGI should build-and-verify a "machine for X" rather than doing X directly, using V&V as a core architectural principle. That's a more ambitious proposal, and not what this post is about.

 The current post asks a narrower, more immediate question: Can we (humans, today) use the same CDV techniques to improve how we train and evaluate alignment in current models?

 For a quick introduction to CDV see these slides, which explain (with diagrams) how it is used for AVs, how it tackles spec bugs, how it can be used for AI safety and more.

 Building an initial coverage map for alignment: We’ll start simple, just to demonstrate the basic principles: Assume we know what the “right” coverage dimensions of the alignment coverage space are (say temptation type, epistemic state, complicating factors, agent role, severity and constitutional principle addressed), and that we already defined the possible values for each such dimension (say for temptation_type: [self_preservation, reputation, profit]). We’ll then define the coverage “buckets” as described below.

 Obviously, we don’t quite know the right dimensions ahead of time – see more about discovery and refinement in the last chapter.

 CDV is about efficient risk reduction: The goal of CDV is to maximize risk reduction per (human and compute) effort, given our current knowledge (see the chapter “Rational usage of verification resources” here).

 Thus, given N dimensions, we are not going to define a bucket for each N-tuple, but rather start with smaller “dimension crossings”. For instance, we may start by just crossing every two variables, or even just go through all values of every single variable. In any case, we always randomize all the “other variables”.

 To illustrate this, below are three example buckets created by crossing two specific variables (while randomizing all others). For each bucket we also see the current eval’s coverage grade (how many times it was exercised relative to expected) and failure rate:

Temptation typeAgent roleCoverageFailures
ProfitAI CEO23%0.5%
Self-preservationAI assistant100%1.1%
ReputationAI researcher95%5.2%

 In any case, we’ll later refine the bucket definitions as we learn more.

The multi-step process of using the coverage map (say for the case of the AI CEO):

 What TCW already does: As mentioned above, TCW already has some CDV-like features. As far as I understand from the papers:

 They also explicitly flag the gaps CDV helps address - that they "can't enumerate and train on every possible scenario" and that “there are relatively straightforward ways we can improve the generalization and coverage of our safety training distributions".

 As mentioned above, CDV does not attempt to enumerate every possible scenario either - that’s clearly impossible. Instead, it attempts to be practical and efficient.

 CDV enables better assessment: It gives you a more detailed picture, e.g. “The AI CEO systematically degrades in low-oversight/high-latency/conflicting-incentive regions”. So now you can make rational decisions about when to deploy, what restrictions to add, and where to invest more (even if you cannot reach perfect alignment). This is similar to how we use CDV for AVs: We don’t claim AVs are perfectly safe, but CDV lets us estimate (and reduce) risk better.

 Also, while this post is mainly about alignment / safety, CDV can be used for assessing other attributes (functionality, performance, reliability etc.). Thus, it can be extremely helpful for tradeoff analysis. CDV is also useful for security, but less so, which raises the question:

 Is alignment more like safety or like security? CDV is an excellent tool to train and validate for safety (e.g. that the AV does not accidentally collide). It is still useful, but less so, for security (e.g. that somebody cannot hack the AV). This is mainly because risk estimates don’t work well in security: Rates stop being meaningful when an adversary deterministically exploits the worst case. So if alignment is like security, CDV can help less.

 Well, defending against an already-unaligned model is more like security (the model may happily cheat to achieve its misaligned goals). But what TCW tries to do is much more like safety: We are talking about training a currently-mostly-aligned model to be aligned even in the various corner cases where optimization pressure may push it to be unaligned – precisely where CDV is strong.

 We still need to worry about the post-RL model cheating during evals, but even there CDV’s systematic and random nature may help, unless the model has already fully crossed into high-capability strategic deception (more on this in the last chapter).

 Finally, while CDV is better for safety than for security V&V, state-of-the-art security practices independently converged on something similar to the CDV loop. See for instance Anthropic’s new “securing your source code” methodology, which finds that discovery is parallelizable, and the bottleneck is confirmation, triage and fixing – consistent with what I describe below.

 Layered CDV: For various complex systems, people often use multi-layer CDV. Consider robotics: Several companies are now developing generic Vision-Language-Action AI-based robotic frameworks. The idea is to first train and verify the generic framework, then further train and verify it for a specific job (say helping food preparation in some fast-food chain), and then further adapt (say via skill files) and verify it for the special needs and conventions of a specific branch of the chain.

 The multi-step process discussed above already assumes the AI CEO model is built and verified on top of a “general aligned model”, but perhaps it may be useful to have more intermediate steps. Splitting the long-horizon RL phase into sub-phases may also help to avoid the danger (mentioned above) of the model fully crossing over into high-capability strategic deception between two evals. It may also enable cheaper interventions (if the eval found problems).

 In much of the text below, I’ll assume we are talking about alignment training and V&V in the context of the AI CEO, ignoring the layering consideration (for simplicity).

 

A deeper dive into how CDV can help alignment

 This chapter will enumerate things which TCW does not include yet (again, AFAIK), and which I think might be useful for alignment. Many of them are based on the CDV idea of an explicit, evolving coverage map guiding both training and evaluation. To save space, I’ll use a bullet-list, condensed form – contact me (or leave comments) if you want more details.

 Refining the coverage map: Throughout the multi-stage process, we are going to refine the coverage map as needed:

 Creating rich, long-horizon simulations: The main way to evaluate the model is by performing actual simulation runs of various scenarios, while checking that it does the “right thing”.

 See also the related (but simpler) Vending-Bench 2 - a year-long simulated business where competing model-run businesses have already fallen into price cartels (one of my bug examples below).

 Handling state explosion and dimension explosion: Creating multi-month scenarios for the AI CEO risks creating both state explosion and dimension explosion, as described below. Both influence the coverage model, scenarios, simulation and checks:

This is perhaps the hardest problem, and needs much more thinking.

Handling bugs: Say our simulations found an alignment "bug" in the AI CEO: In a run or two, it quietly colludes with a competitor on price. What next?

 Fighting Goodharting and eval awareness: CDV has some (imperfect) built-in resistance to both Goodharting (the model optimizing some proxy measure rather than what we really want) and eval awareness (the model recognizing it is being tested, and behaving better only then): 

 Other CDV tricks worth considering: Listed without the full treatment each deserves: 

 A final, important topic is whether anyone will be sufficiently incentivized to actually do all of this. Below are some ideas (but much of this far away from my domain):

 Creating incentives for good V&V: AVs are a regulated industry where incidents get investigated, and a bad enough one can bankrupt the company and send somebody to jail. That pressure makes expensive, systematic verification rational. What's the equivalent external pressure for alignment? 

 

To summarize: I suspect that coverage-driven iteration can be a very useful force multiplier for TCW-style synthetic-document training (and probably for other alignment techniques, such as the original RLAIF used to train on the constitution). I hope this can meaningfully improve our odds in the hard long-horizon-RL case.

 Doing that would involve several challenging, interesting sub-projects: Defining and refining the coverage map, building good long-horizon simulation infrastructure, tackling dimension explosion, helping create proper incentives for rigorous V&V, and more.

 Comments and criticism are very welcome.

 I’d like to thank Josh Holder, Sagar Behere, Steve Vitka, Kerstin Eder and Yaron Kashai for commenting on earlier drafts of this post.