The black box and the question we keep asking
By this point in the ladder you know how a deep network turns inputs into outputs: millions of multiplications flowing through a stack of layers, tuned by gradient descent. You can describe every step — and still have no idea *why* the model denied this loan or flagged this scan. That gap between mechanism and meaning is what explainable AI tries to close. It asks a humble, practical question: can we offer a human a story about one decision that helps them trust it, check it, or contest it?
Two flavors of answer exist. Some models are *interpretable by design* — a short decision tree or a linear formula whose logic you read off directly, honest but often less accurate. The rest are explained *after the fact*: you keep the opaque model and bolt on tools that produce a story, flexible but only ever an approximation. This guide is mostly about the second kind, because that is what we point at the big models — and it carries a built-in danger we'll return to again and again.
What did the model lean on? Feature importance
The most basic question of all: of everything the model *could* attend to, what actually moves its answers? Feature importance gives a ranking. Predict house prices from square footage, neighborhood, age, and paint color, and a good importance chart tells you square footage and neighborhood drive things while paint color barely registers. There are two honest ways to ask. One reads the model's internals — how often a tree splits on each feature. The other, *permutation importance*, is model-agnostic: scramble one feature's values across the dataset and watch how much accuracy drops. A big drop means the model truly needed it.
This is often the first place a team catches a model cheating. If "patient ID number" ranks as a top predictor of disease, something is wrong — the model has latched onto a leak, not medicine. That kind of shortcut is exactly what robustness work in this rung is hunting for. But the caveat is sharp: importance tells you what the model *used*, never what *causes* the outcome. When features are correlated, importance gets smeared confusingly across them, and a genuinely causal feature can rank low because a spurious neighbor steals its credit.
Why this one decision? SHAP and LIME
Feature importance is *global* — it describes the model overall. But the person whose loan was denied wants a *local* answer: why my case? Two methods dominate here. SHAP borrows a recipe from game theory: treat the prediction as a prize and each feature as a player, then split the credit fairly by averaging each feature's marginal contribution over every order it could have joined. For one specific case, SHAP reports how much each feature pushed the prediction up or down from a baseline — and the pushes sum *exactly* to the output, which is why it's called additive.
baseline churn 0.30 + recent complaints +0.25 + month-to-month +0.18 + short tenure +0.05 -------------------------- prediction 0.78 (= sum, exactly)
LIME takes a different route: it builds a tiny simple model that mimics the big one — but *only* in the immediate neighborhood of your one case. A curving road looks straight if you inspect just a few feet of it. So LIME jiggles your input into many slightly altered versions, asks the black box what it predicts for each, and fits an easy-to-read linear model to those local answers. Out comes a short list: "flagged as spam mainly because of 'winner' and 'free,' despite 'meeting'." Crucially LIME needs nothing about the model's internals — it works on *any* model, even a large language model you can only query.
Pictures of attention: saliency maps and the attention trap
For images, the most intuitive explanation is a saliency map: a heatmap laid over the input, glowing where the model looked hardest. The simplest version asks — if I nudge each pixel slightly, how much does the answer change? — and reads it off the model's gradients, the very derivatives used in training. For a convolutional network that labeled a photo "dog," you *want* the dog to light up and the lawn to stay dim. When the glow lands on the background instead, you've caught a shortcut — the famous wolf-versus-snow classifier that was really detecting snow.
Here the field learned hard humility. Some popular saliency methods produce nearly identical-looking maps even when you randomize the model's weights into nonsense — meaning the pretty picture wasn't explaining the trained model at all. So run the sanity check: if randomizing the weights barely changes the map, the map reflects the input or the method's own quirks, not the model. A vivid heatmap is not automatically a faithful one; it can be a real window or a Rorschach blot, and you can't tell which by how plausible it looks.
Transformers tempt us with a free version of this. Their attention mechanism already computes weights saying how much each word focuses on each other word — a tidy grid sitting right there, no construction needed. It is irresistible to read it as attention-as-explanation: "the model translated 'bank' as riverbank because it attended to 'river'." But careful research found you can often construct completely *different* attention patterns that yield the very same prediction. If two contradictory maps give the identical answer, neither is the reason for it. Information also flows through other paths; attention shows what the model *connected*, not *why* it decided.
The human question: counterfactuals
Every method so far tries to dissect the model. A counterfactual explanation refuses to, and answers the question a person actually asks: "What would need to change for the decision to flip?" Not "here are your feature weights" but "your loan was denied; had your annual income been $5,000 higher, it would have been approved." It explains not by opening the box, but by showing you the nearest world where the answer changes — which matches how humans reason, through small what-ifs.
A good counterfactual is *minimal* (change as little as possible), *realistic* (you can't ask someone to change their age), and *actionable* (it points to something the person could do). When all three hold it becomes *recourse* — practical advice for a different outcome next time. It can even expose bias: "with one more year of experience, accepted" is useful; "had you been five years younger, accepted" is a red flag for age discrimination and a candidate for the algorithmic bias work this rung also covers.
Counterfactuals are often the least misleading form of explanation, because they don't pretend to reveal the model's true inner logic — they only describe its behavior near one input, which is exactly what an affected person needs. But the honest caveat is the same one again: a counterfactual tells you how to flip the *model's* verdict, not how to change your real-world situation. If the model leans on a spurious feature, gaming that feature flips the answer without improving anything real. A counterfactual is only as trustworthy as the model behind it.
Choosing, and staying honest
There is no single best method, because "explain" means different things to different audiences. Match the tool to the question:
- What does the model rely on overall? Reach for feature importance (or SHAP aggregated across many cases) — a global view, good for auditing and debugging.
- Why this one prediction? Reach for SHAP or LIME for a local breakdown of which features pushed the answer — good for a data scientist inspecting a case.
- Where in the image did it look? Reach for a saliency map — but run the weight-randomization sanity check before you trust it.
- What should the affected person do? Reach for a counterfactual — the most actionable, human-facing answer, ideally vetted for fairness.
A red thread runs through every method: these are *post-hoc* stories, not the model's actual mechanism. They can be unstable, can disagree with each other, and can be gamed to look fair while the model misbehaves underneath. A confident SHAP chart for a biased model just explains the bias beautifully. The deeper effort to read the actual computation — mechanistic interpretability, the next guide's territory — is harder but more honest. Until then, treat every explanation as a careful hypothesis to investigate, never a verdict. A good explanation builds appropriate transparency and trust; a glib one manufactures false confidence.