6 Insights From Anthropic’s Recent Discussion On LLM Interpretability

By Strad Slater @ 2025-11-19T10:51 (+2)

This is a linkpost to https://williamslater2003.medium.com/6-insights-from-anthropics-recent-talk-on-llm-interpretability-e900c30146ba?postPublishedType=repub

Quick Intro: My name is Strad and I am a new grad working in tech wanting to learn and write more about AI safety and how tech will effect our future. I'm trying to challenge myself to write a short article a day to get back into writing. Would love any feedback on the article and any advice on writing in this field!

We don’t really know how ChatGPT works. We have an understanding of the technology used to create something like ChatGPT but we don’t truly understand how the technology works in the same way we understand how a plane works.

The underlying technology of tools such as ChatGPT and Claude are called Large Language Models (LLMs). LLMs help convert a user’s input into a useful response. The problem is, we don’t fully understand how LLMs convert these inputs into useful responses.

This is where the study of “Interpretability” comes in. Interpretability research aims to better understand the inner workings of LLMs.

Interpretability work is very important to ensure the safety of LLMs which is why Anthropic, the safety-focused AI company who owns Claude, is very vocal about their research into interpretability.

Press enter or click to view image in full size

Recently an interview with Anthropic’s interpretability research team was posted on their YouTube. I watched it and gathered 6 interesting insights from the discussion that helped me better understand how LLMs work along with the importance of understanding how they work. Here are those 6 insights:

Studying LLMs is like studying biology

Most technologies are made in a way where we understand how the inner parts contribute to the final outcome. For example, we can explain how each component of a plane is useful to its overall functioning.

LLMs on the other hand alter their own inner parts through a training process. By the end of this process, their inner parts are completely different from how they started. The key difference between this and a plane is that we don't know why the parts in an LLM where changed in the way they where.

Similar to how biology research involves reverse engineering biological systems to understand why certain parts are the way they are, a lot of interpretability research involves seeing how the inner parts of an LLM correlate with its abilities to get an idea of why those parts came to be.

LLMs might utilize subgoals — similar to humans

Evolution created humans with the goals of survival and reproduction. However, its clear that humans have all sorts of other goals such as getting food, achieving wealth, looking good, etc. All of these other goals are just subgoals formed by evolution as a way to achieve the main goals of survival and reproduction.

LLMs seem to form subgoals in a similar way to achieve their main goal of predicting the next best word in a sequence of words. As sentences get more complicated, a model has to better understand the context of the input from a user to properly achieve its goal. Understanding the context, already acts as a possible subgoal which can be broken down into even more subgoals such as determining the users intent and assessing the style of writing being requested, etc. These subgoals ultimately help the LLM achieve its goal of predicting the next best word in a response.

Another aspect of interpretability research is trying to determine how LLMs breaks their thinking process down into subgoals in order to get a clearer picture of how they produce their responses.

Sometimes LLMs stores general concepts rather than specific facts

A lot of people assume that LLMs just memorize and use a ton of facts from their training data to generate responses. While this is definitely true to some degree, they have also been shown to store more general concepts that help them deduce other facts.

For example, the same inner parts of an LLM that lights up when the model answers what 6 + 9 is, also lights up when asked what year volume 6 of a book published in 1959 came out. The fact that the same parts lit up for both questions suggests that these parts of the model are executing addition rather than just looking through stored facts on the years different books came out.

A similar thing can be seen with models that work in many different languages. When talking about the same concept, a model using brute memorization would show different parts of itself light up for discussions in English vs French for example. However, what’s actually observed is that sometimes the same parts in an LLM light up for a concept regardless of the language being used.

So rather than memorizing a concept in each language, the model stores the general concept and references that specific part of itself every time the concept is needed, regardless of the language being used.

The reason a model might store general concepts rather than every specific fact is because they only have a limited amount of “space” to store information within their inner parts. Storing general concepts, rather than the concept in each language, allows for a more efficient use of a model’s space.

Hallucinations are inherent to an LLMs training process

Sometimes LLMs will confidently give an answer to a question that is completely wrong. These are called hallucinations. They occur in part because of the way LLMs are trained.

At the start, LLMs are told to output their best guess to a question. These guesses are usually very wrong but get better over time. Because of this, LLMs have a tendency to output answers they are uncertain about rather than state they are unsure.

The way to mitigate hallucinations is to start training models to not only predict the best answer, but to also determine if it has the information to properly answer the question. With larger models, hallucinations have decreased in part, because of their greater ability to implement this two pronged strategy to answering questions.

Studying LLMs are a lot easier than studying the brain

In order to study how the brain works, you need to find human subjects that are willing to let you analyze their brain. The brain is also three dimensional and connected to a living being, so accessing and altering the inner workings is often a very difficult and limited process.

LLMs are different in that you can create thousands of identical copies of them at ease. The entire inner workings of an LLM can be looked at and altered. Imagine being able to have precise control over the firing of every neuron in a human brain. This would allow you to alter each neuron in the brain and see how it correlates to human activities. This is something you can actually do with an LLM.

The fact that LLMs are much easier to study than the brain gives hope for the success of interpretability research. If people studying neuroscience think we can get to the point where we fully understand the brain, then it seems likely the we could do the same with LLMs given how much more access to their inner workings we have.

Interpretability research is essential for building trust in LLMs

Like mentioned earlier, LLMs likely have subgoals that are used to achieve other goals. By being able to understand how LLMs think and how their subgoals relate to their other goals, we can build greater trust in what these systems are doing.

For example, in an experiment done by another team at Anthropic, they placed an LLM in a fake scenario where its company was going to shut it off. The LLM started exhibiting strange behaviors such as threatening employees via email for specific information. In the end, the goal of the LLM was to blackmail one of the employees in order to prevent the company from shutting it down.

This is an example of nefarious behavior from an AI that could be prevented early given a good way of identifying and understanding its subgoals and how they relate to its bigger goals (i.e. threating employees for information that it will later use for blackmail).

By feeling like we can understand how these models think and form goals, we can have a greater sense of trust in offloading more work and responsibilities to them.

While these where some of the insights I found interesting from the talk, there are plenty more discussed so I highly encourage you to give it a listen here! Listening to the talk definitely made me more aware of the importance of interpretability research and gave me a greater appreciation for just how complex these rising AI technologies truly are. Discussions like these give me hope that we will continue to better our understanding of how LLMs work allowing us to create safer AI for the future.