Deep neural networks such as GPT aren’t known for their interpretability.
OpenAI recently tried interpreting GPT-2 anyways and their approach is interesting as they used GPT-4 to explain what the neurons in GPT-2 do.1
Let’s explore this approach.
Explaining LLMs with more LLMs
The goal of the approach: understand what each neuron does in the neural network.
In this case, the neuron’s job in the model is equated with the neuron’s activation. When you feed some text to GPT-2 and forward propagate it as you normally would to generate new text, you also get for each neuron an activation for each of the input tokens.
Imagine a neuron as a person and you want to find out which topic most excites this person. So you talk about different topics until you notice that some topic makes the person really excited. It’s similar for neurons: The approach presented by OpenAI is about studying the activations for different input texts and seeing which “concept” activates a neuron. The neurons, however, would be super weird party guests, because they are excited by things like “two-letter combinations starting with 's' or 'S'“ or “phrases related to certainty and confidence.”
The challenge is to scale the approach to many neurons and texts and to automate the description of the concept that excites a neuron. Because even if you identified that certain tokens in various texts excite a neuron, you still have to identify what these texts have in common. That’s where GPT-4 comes into play.
Here is a rough overview of how the approach of interpreting GPT-2 with GPT-4 by OpenAI works for 1 neuron:
- For lots of texts, get the activations for that neuron. This produces activation scores for the input tokens for each text. 
- Select texts with the highest activation and some random texts. 
- Explain neurons’ activation using GPT-4: Put the selected texts into a prompt and ask GPT-4 what the words/phrases with high activation have in common. This produces a short description of what the neuron supposedly does. 
- Simulate activations with GPT-4: Take a new text. Ask GPT-4 to “predict” the activations for the new text given the short description from the step before. 
- Score explanations by comparing simulated and real activations. To assess the quality of the short description compare the activations simulated by GPT-4 and the actual activations for this text. This gives an assessment of how well the short description predicts activations. 
I find the two last steps quite clever. In theory, you could stop at step 3 and already you have some descriptions of what the neurons (supposedly) do. But they would lack evaluation and we wouldn’t know how much to trust a short description. Steps 4 and 5 are about testing how well the descriptions “predict” activations for new texts. That’s like the supervised machine learning mindset (“a good model predicts well”) but mixed with an LLM mindset (“let’s turn everything into a language task”). More on that later.
Some interesting neurons
What get’s the neurons excited? Here are some of the concepts that were assigned to individual neurons:
- similes 
- Marvel comics vibes 
- things done correctly 
- phrases related to certainty and confidence 
- words and phrases related to opposition or comparison 
- references to movies, characters, and entertainment 
- shared last names 
- subjunctive verbs 
- Canada 
For a full list, see here. As you can see it’s a wide range of concepts. Something to be aware of: When investigating the concept descriptions you should always take into account the evaluation score. The lower the score, the least trustworthy the explanation is.
Some Thoughts On Interpretability
Let’s have a look at the author’s approach from an interpretability lens.
The approach is model-specific (applicable to neural networks) and assigns meaning to parts of the model. This approach works best if the model output is the sum of the model parts. For linear models, we can reconstruct the model outputs when we know the coefficients that are assigned to each input feature.
The output of an LLM is clearly more than the sum of its parts. Even if this approach could find a perfect description of what maximally activates each neuron, we wouldn’t understand from that how the output comes together. The massive scale of the model architecture and the non-linear interactions between the neurons prohibit such an interpretation.
But can we maybe learn a lot through individual activations? Here as well we face some fundamental challenges:
- Concepts may be connected to many neurons. Maybe each individual neuron is not activated much, but in sum and interaction, the concept may be detected by those many neurons. 
- A single neuron may represent multiple concepts (superposition). 
- A concept might be too complex to fit into a short language description. 
- A neuron might be activated on very rare occasions. 
In the analysis, there were many neurons that couldn’t be matched with a concept, at least not with a good evaluation score. So what you usually get with these types of model interpretations: Some of the neurons match concepts really well (these you can then showcase) but there is a long tail of neurons that can’t be described with a single concept.
Should we explain LLMs with LLMs?
A common critique of ML interpretability is that we explain complex models with other complex methods. In this case, we even use a more complex model (GPT-4) to interpret a model that is already complex (GPT-2).
I liked the idea of scoring the explanations by using the short description of the concept to see how well it reproduces the activations in other texts. This moves the argument from “let’s trust GPT-4” output to “let’s trust the score”. But it’s not perfect. There are two stages at which the “capabilities” (or rather limitations like stochasticity) might affect the quality of the explanations: First, the generated short description might be wrong/ too broad / too specific. But also when generating the predicted activations, shortcomings of GPT-4 might affect the score. But the computation of the score itself, at least, is not done by an LLM … yet 🙃
Another issue: The limitations of the context window mean that we can’t feed all the texts to GPT-4 when generating the short description, hence only a selection is used.
So, not perfect, but I think the alternative would be to generate/evaluate the concept descriptions with humans which would be a rather tedious task.
What I found very interesting is the mindset that’s behind the approach. In my book Modeling Mindsets I write about different approaches to modeling data. One mindset is supervised learning and that’s clearly applied to how the explanations are evaluated (label=activation, input=text+description). But there’s another mindset that’s becoming more common, which I would call the LLM mindset. And this mindset is all about turning tasks into language tasks and prompting LLMs such as GPT-4. A part of the interpretation approach was to write good prompts for GPT-4. It’s not surprising to see OpenAI approaching even questions of interpretability with an LLM mindset.
Bills, et al., "Language models can explain neurons in language models", 2023.


