A Potential Strategy for AI Safety — Chain of Thought Monitorability

By Strad Slater @ 2025-09-19T18:42 (+3)

This is a linkpost to https://williamslater2003.medium.com/a-potential-strategy-for-ai-safety-chain-of-thought-monitorability-043f5b94f60e

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how AI will effect our future. This is my first article in a while and first post in the forum. Would love any feedback on the article and any advice on writing in this field! 

Over the past couple of years, Large Language Models (LLMs) have grown in both capability and popularity. This growth has helped garner increasing awareness for AI safety. AI safety encompasses issues ranging from immediate problems, such as discrimination and misinformation from LLMs, to the existential problem of AI alignment.

Many of the top companies working on AI are attempting to create Artificial General Intelligence (AGI), an AI that outperforms humans in all tasks. With the potential for AGI on the horizon, the need for alignment strategies seems increasingly important.

This article explores one possible strategy called “Chain of Thought Monitorability,” which is heavily discussed in the paper, “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.”

What is Chain of Thought Reasoning?

When asking ChatGPT a question, a user might notice that sometimes the model starts thinking. The user can click on the thinking button from the model which exposes a dropdown of generated text. This is the model’s Chain of Thought (CoT).

Humans have fast and slow forms of thinking. The fast form allows the quick production of actions such as singing the alphabet or tying ones shoes. These tasks are not too complicated. They almost feel like a reflex. For more complex tasks, such as solving a math problem, humans utilize a slower form of thinking which triggers more neurons in their brain, allowing for extended reasoning.

The CoT for an LLM works similar to this slower form of thinking. Having a CoT allows the LLM to break a complex task down into smaller parts, sequentially workout each part, and then arrive at a final result.

Older GPT models such as GPT-2 worked using autoregressive models which predict the next best word in a sentence based on the sequence of words before it. For example, take the incomplete sentence:

“I love my mom and…”

Given the words and their context, which is derived by the model through training on text data, GPT-2 might predict that the most likely word to follow the incomplete statement would be “dad.”

In these models, the user’s question acts as the incomplete statement. Someone could ask, “What color is the sky?” and the model would generate the word most likely to follow this question based on the text data it was trained on. The question is then fed back into the model with this new word added to the end, and the process repeats until the sentence is finished.

By understanding this process, one can see how CoT is effective for more complex tasks. Since complex reasoning tasks require multiple steps, it can be difficult for a model to jump straight from a question to its final answer. Researchers found that these models were more effective at answering complex tasks when CoT prompting was introduced.

CoT prompting involves asking a model to explain the reasoning for its response before giving it. This prompts the model to output the first step to solving a given task instead of attempting to jump straight to the conclusion. This first step is then fed back into the model to generate the rest of the response.

Often times, the second step to solving the task is the most likely statement to follow text containing the first step. The same goes for each subsequent step relative to the previous. This process repeats, outputting each step in solving the overall task until, finally, the answer is produced. This final answer is often more accurate because the model was guided through steps relevant to the task rather than attempting a long “jump” to the final answer.

CoT essentially acts as a window into a model’s thinking process which is what makes it valuable to AI safety efforts. In the same way one could gain more information on a human’s true intentions by reading their thoughts, being able to read a model’s thought process can help a researcher better detect misalignment.

Chain of Thought Monitorability

Over 40 researchers from the top AI labs (Google DeepMind, OpenAI, Anthropic, etc.) have come together to back a recent paper on CoT monitorability called “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety.” In this paper, the value of CoT monitorability as a strategy for AI safety, its drawbacks, and recommended suggestions for future research are discussed.

How Chain of Thought Monitorability Can Help in AI Safety

A few assumptions need to be made in order for CoT monitorability to be useful for AI safety.

The first is that potentially misaligned actions from AI models are complex, necessitating the need for slower, chain of thought reasoning in LLMs. The paper explains how with current models, misaligned behavior such as carrying out sophisticated cyberattacks or exfiltrating a model’s own weights are far too complex for today’s models to execute unnoticed. Due to the complexity of these tasks, an LLM would have to utilize a CoT resulting in the exposure of any underlying intentions.

Another assumption is that models have a propensity to utilize CoT, even if a task might not require it. Because of CoT’s advantage, a model will often utilize the strategy even when not required since an accurate answer is more likely to be produced.

The last assumption is that the reasoning a model uses in its CoT is understandable by humans. The paper calls this the externalized reasoning property. The output from a model’s CoT is generated the same way its final answer is, using statistics to predict the next best word in a sequence of words. This means there is no given that the reasoning produced will make sense at all, let alone make sense to humans. However, current models have shown to learn a strong “natural language prior” through pretraining which means its use of natural language heavily matches humans.

With these assumptions in place, CoT has shown to be useful for the safety of current models. For example, current models have been shown to give away misaligned behavior by outputting phrases such as “Let’s Hack” or “Lets sabotage” into their CoT. Models’ plans to fake alignment with the true intention of achieving an alternative, misaligned goal have also been found in their CoTs.

CoT can also be used to detect important differences in a model’s training vs deployment environment. It has been shown that a model’s behavior can be heavily influenced by knowing whether or not it is currently being trained. By peering into a model’s CoT, this recognition of its training status can be detected allowing the researcher to adapt accordingly.

While CoT monitorability has a lot of promise in improving AI safety efforts, its usefulness is becoming fragile as models progress.

The Fragility of Chain of Thought Monitorability

In many ways, CoT Monitoring can become trivial in AI safety efforts. For example, with enough intelligence, an AI could potentially hide certain aspects of its reasoning from its CoT. This could be done on purpose in order to prevent humans from finding out its true intentions or unintentionally through losing the propensity to output all reasoning in the CoT. Even with the externalized reasoning property, a model is not required to externalize all of its internal thoughts.

In addition to leaving out information, a model might output wrong information. The thought process in a model’s CoT could seem to correlate with human reasoning while still not representing the actual logic used by the model. For example, a model could claim to have calculated the derivative of a given function by following an outputted set of steps when, in reality, the model just found the final answer off a math forum.

The reverse problem can also occur in which a model much smarter than humans reasons at a level humans can’t comprehend. In this scenario humans could read what the models are thinking, but lack the ability to understand what it means.

The training process for CoT itself can also cause issues. A lot of CoT training is done through human feedback. The model learns by following human judgement on the understandability and correctness of the CoT reasoning. This pressure could cause a model to optimize its responses for human likability which might not always line up with the actual reasoning of the model.

Even if CoT monitorability can overcome these issues, it is possible that more advance AI will exhibit a different architecture from today’s models where CoT doesn’t fit in at all. So even if CoT monitorability can be useful right now, it is important to assess its value in the grand scheme of AI safety over time.

With this in mind, CoT monitorability still shows enough promise right now to warrant some amount of effort into future research directions on the strategy.

Useful Research Directions in Chain of Thought Monitorability

As the need for AI safety grows, it is important to refine current and future safety techniques. The paper considers possible research directions that could help improve the effectiveness of CoT monitorability as a safety technique.

One direction is to determine the best ways to evaluate CoT monitorability. What techniques can be utilized to test how well CoT monitorability actually works as a way to detect misbehavior in a model? This includes strategies such as:

Developing methods for evaluating monitorability will help researchers in determining its viability as a strategy for misalignment detection.

Other useful directions mentioned include:

The paper also suggest a framework AI developers can use when creating new AI models. The framework calls for the creation and use of a standardized evaluation criteria for CoT monitorability, the publishing of results from this evaluation, and the use of these results in the deployment decisions related to the models of interest.

This framework allows developer’s to easily analyze how monitorable their AI models are which incorporates safety into their decision making. For example, under this framework a developer could weigh the better monitorability of an older model against the better performance of a newer one when deciding which one to deploy.

These research directions could help make CoT monitorability a more viable tool for AI safety, while the framework for AI developers would help implement the technique in practice.

While there is a lot of safety potential for CoT monitorability, it is important to balance its efficacy against other techniques in order to ensure that the best combination of strategies are being developed and used.

Possible Limitations in Pursuing Chain of Thought Monitorability

AI safety is a broad field with many different aspects of AI requiring focused effort. The paper discusses potential pitfalls from focusing too much energy on CoT monitorability.

If models are able to execute harmful tasks without any externalization of misaligned thoughts, then time spent on CoT monitorability would at best be wasted and at worse lead researchers towards a false sense of safety. This is why determining a model’s ability to execute dangerous tasks without externalization is important when considering CoT monitorability research.

If a model can easily hide its intentions and lie through its CoT, this would also render the strategy useless. This is why testing for a model’s ability to obfuscate their reasoning on purpose is important.

All these possible ways that CoT monitorability could be trivial for AI safety bring into question how much time and resources should be spent on this technique vs others in the field.

With all this taken into account, the paper recommends that this strategy be seen as an additional layer of safety rather than a replacement for other methods currently being researched.

 

In these early stages of AI development, it is important that humans use every tool they can to maintain the safety and control of these systems. More safety early on allows for more time to develop better safety measures against more complex future models. As AI progresses and the risks from the technology grow in quantity and magnitude, any amount of time humans can gain to better understand these models can greatly impact humanity’s relationship with AI for the better.

Citations:

  1. Korbak, Tomek, et al. Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety. arXiv, 15 July 2025, arXiv:2507.11473. https://doi.org/10.48550/arXiv.2507.11473.

OscarD🔸 @ 2025-09-21T01:00 (+2)

Welcome, and thanks for writing this - I agree CoT monitoring seems like a potentially very valuable tool!