Looking Inside: Mechanistic Interpretability

From "which input" to "what computation"

The previous guides in this rung gave you tools that treat a model as a black box and ask, from the outside, why did it predict that? A saliency map highlights influential pixels; explainable-AI methods attribute a score to each input feature. These are genuinely useful, but they answer a limited question — which inputs mattered for one prediction — not the deeper one: what is the network actually computing? Mechanistic interpretability sets a far more ambitious goal. It treats a trained neural network like a compiled program with no source code and tries to reverse-engineer that program back into human-readable algorithms.

The analogy is worth taking seriously. Training writes a program into millions or billions of numbers — the parameters — using gradient descent rather than a human typing code. We have the full binary: every weight is sitting right there, fully observable. What we lack is the comments, the variable names, the structure. Mechanistic interpretability is the disassembly project: take those raw numbers and recover the meaningful units of computation hidden inside them.

Probing: asking what a layer already knows

The gentlest way in is probing. The idea: freeze the network, run some inputs through it, and grab the internal activations at a chosen hidden layer. Then train a tiny, simple classifier — a probing classifier, usually just linear — to predict some property of interest from those activations. If a one-line linear probe can read "is this word a verb?" or "is this board position winning?" straight out of layer 12, that property is evidently represented there, in an easily accessible form.

Probing across layers paints a developmental picture: early layers in a vision net carry edges and textures, later layers carry object parts; early layers of a transformer track surface word forms, later layers track meaning and syntax. This is how researchers find evidence that an embedding or activation encodes something abstract that no one explicitly trained it to store.

There is also a subtle limit baked into probing: it tells you a property is decodable from a layer, not that the network itself uses it. Information can be present yet ignored by the downstream computation. Probing answers "is it there?"; it does not answer "does the model rely on it?" — and that gap is exactly what the next, more causal methods try to close.

Features and circuits: the working vocabulary

Mechanistic interpretability builds on two core nouns. A feature here is not the input feature you fed the model, but an internal direction in activation space that consistently responds to something meaningful — a curve detector in a vision net, a "this text is in French" direction in a language model, a neuron that fires for the Golden Gate Bridge. A circuit is a small connected subgraph of features and weights that together implement a specific behaviour, like a tiny algorithm wired out of activations and connections.

A celebrated example is the "induction head" found in transformers: a pair of attention components that, working together, implement the rule "if the pattern AB appeared earlier and I just saw A again, predict B." It is a genuine copy-and-continue algorithm, discovered not by reading documentation but by tracing how information flows through the attention mechanism. Vision researchers similarly traced circuits that build a car detector out of earlier window, wheel, and chrome detectors. These are real, reproducible findings — small algorithms recovered from the weights.

But here lies the field's central headache: polysemanticity. Most neurons are not clean, single-meaning units. One neuron might fire for cat faces, the fronts of cars, and certain legal phrasing all at once. The leading explanation is that networks cram more features than they have neurons by storing them as overlapping directions — a network represents thousands of concepts in a few hundred dimensions by letting features share neurons. This is why you cannot just read a model off its neurons one at a time.

# A neuron's activation is a sum over features it participates in:
#   neuron_j = w1*feat_A + w2*feat_B + w3*feat_C + ...
# Sparse autoencoders try to invert this:
#   activations  ->  (encode)  ->  many sparse feature units
#                <-  (decode)  <-  reconstruct the activations
# Goal: each recovered unit means ONE thing.

Sparse autoencoders aim to untangle polysemantic neurons back into monosemantic features.

A major recent line of work uses sparse autoencoders to attack polysemanticity head-on: train an autoencoder that re-expresses a layer's activations as a much larger but mostly-zero set of units, each ideally meaning one clean thing. Applied at scale, this has surfaced millions of interpretable features in production language models, including features you can amplify to steer behaviour. It is the most promising tool the field has for turning tangled neurons into a readable vocabulary — though it is far from a solved problem.

How you prove a circuit is real

A story about a circuit is just a hypothesis until you test it causally. The workhorse technique is intervention: don't just observe the network, edit it mid-run and watch what changes. This is the interpretability cousin of the ablation study you met earlier — but performed surgically on internal components rather than on training ingredients.

Form a hypothesis: "this head copies the subject's gender into the verb position."
Ablate it: zero out or replace that component and check whether the target behaviour breaks.
Patch it: run a clean and a corrupted input, then splice one activation from one run into the other (activation patching) to localize exactly which part carries the effect.
Confirm: predict in advance what edits should and should not change, then verify the model behaves as your circuit story says.

This causal discipline is what separates mechanistic interpretability from just-so storytelling. It also sharply distinguishes it from reading attention weights as an explanation — a tempting shortcut that often misleads, because where a model looks is not reliably the same as why it decides. The same skepticism applies to a saliency map: a pretty heatmap can be persuasive and still not reflect the true computation.

Why bother — and where it connects

The payoff is more than scientific curiosity. If you can find the circuit a model uses, you can debug it: discover that a classifier keyed on a spurious correlation (the snow behind the husky, the watermark on the medical scan) and is leaning on shortcut learning rather than the real signal. You can edit knowledge, steer behaviour by amplifying or suppressing a feature, and check whether a fix actually removed a problem or just hid it. Mechanistic insight also feeds back into AI alignment and AI safety research, where the dream is to detect deception or unsafe reasoning by inspecting internals rather than only watching outputs.

Be clear-eyed about the limits, though. Today the field can fully reverse-engineer toy models and isolated circuits, and can surface large feature dictionaries in real models — but it cannot hand you a complete, faithful blueprint of a frontier large language model. The work is labour-intensive, often relies on still-unproven assumptions (like the linear-feature view), and findings in one model may not transfer to another. "We found a circuit" is real progress; "we understand the whole model" is not yet on the table.