When I heard about “Mechanistic Interpretability” (MI), I was confused. Behind the term, I found a motivation to interpret neural networks using methods like feature visualization. However, I was confused about whether it was somehow different from other interpretability research.
To clarify things, I provide a brief classification of mechanistic interpretability.
Mechanistic interpretability aims to reverse-engineer a neural network into human-understandable mechanisms. MI focuses on transformers (specifically LLMs) but is not limited to these neural network architectures.
MI studies the internals of the neural networks and therefore excludes model-agnostic methods such as LIME or SHAP. In the taxonomy I use, many MI methods would simply fall under “model-specific interpretability”. The word “mechanistic” implies a causal perspective, but MI sometimes refers more broadly to interpreting neural networks.
Mechanistic interpretability research requires tools to look inside the transformers. Some examples: Feature visualization, circuit analysis, causal intervention within the network, logit lens which applies the final logit to intermediate layers (specific to transformers), and many more.
I recommend Bereska, L. & Gavves, E. Mechanistic Interpretability for AI Safety — A Review. TMLR (2024) for a more detailed introduction to MI and its methods.
Neural network interpretability has been around for a while, so where did “mechanistic interpretability” suddenly come from? Especially if it’s “only” model-specific interpretability? The goal of MI and its methods is only half the story; to understand Mechanistic Interpretability, let’s look at its origins.
Origins of Mechanistic Interpretability
This section is based on Saphra, Naomi, and Sarah Wiegreffe. "Mechanistic?." arXiv preprint arXiv:2410.09087 (2024).
The name was coined in 2020 by Chris Olah (known from distill.pub and early days of OpenAI, now Anthropic).
MI was mostly picked up by people in machine learning around computer vision and the field expanded from there. Then came the LLM hype, and the MI community shifted focus from convolutional neural networks (computer vision) to transformer architectures (natural language processing). The MI community has its own language, e.g. superposition refers to the entanglement of concepts by which a neuron is activated. It’s also worth noting that there is a “superposition” of MI and AI Safety (including AGI doomerism), which helps to better understand some motivations.
But by that time, there was already a vibrant NLP Interpretability community with its own research. And when the MI community started publishing papers, there was a clash. Many in the NLP Interpretability community were frustrated that the existing research was being ignored by the MI community. They were reinventing the wheel.
This shows that Mechanistic Interpretability is not just a technical term, it’s a cultural term that signals which community you are in. Since then, however, a broadening has happened. Given the hype around LLMs and Mechanistic Interpretability, embracing the MI label can be useful.
I hope this very brief overview serves as a useful categorization for MI. Mechanistic Interpretability not just a bunch of methods, but an (originally) distinct community with its own assumptions and motivations.