Error Analysis & Debugging

Past the scoreboard: actually look at the mistakes

By now you can compute a clean test score on a held-out set and you trust it, because earlier rungs taught you honest validation and the confusion matrix. But a number like "91%" is a verdict, not an explanation. It tells you *how often* the model is wrong; it never tells you *what kind* of wrong, or *why*. Error analysis is the habit of closing the spreadsheet and reading the actual cases your model got wrong — one by one, with your own eyes.

Here is the concrete recipe practitioners actually use, and it is humbler than it sounds: pull 100 to 150 examples the model misclassified, put them in a table, and tag each one with a short reason — "blurry photo", "sarcasm", "label was wrong", "rare dog breed". Then count the tags. Almost always the errors cluster: maybe 40% of misses are one cause you can fix in an afternoon, and a long tail of one-off weirdness you should ignore for now. This single hour of reading routinely beats a week of blind hyperparameter tuning.

Slicing: where one number hides ten stories

An aggregate score is an average, and averages lie by smoothing. A model at 92% overall might be 97% on the easy, common cases and 61% on a small but important subgroup — and if that subgroup is, say, night-time photos or speakers of a minority dialect, the average quietly buries a failure that matters enormously in deployment. Slicing means computing your metric separately on meaningful subsets of the data instead of only on the whole.

Choose slices that mean something to a human: by class (your earlier rung on class imbalance already showed why rare classes deserve their own row), by input property (length, image size, time of day), by data source, and — critically — by demographic group when the decision affects people. Reading slices is also how you catch the model leaning on a spurious correlation: if accuracy collapses the moment you slice out the cases where a giveaway artifact is absent, the model was riding that artifact, not learning the task.

Bias vs variance: the master diagnosis

Once you know *which* errors hurt, the next question is *why* the model makes them — and almost every cause collapses into one of two diagnoses you met as the bias-variance tradeoff. High bias (underfitting) means the model is too simple or too lazy to fit even the training data: it gets things wrong it has already seen. High variance (overfitting) means it memorized the training set's quirks and fails to generalize: great on train, poor on test.

You tell them apart by comparing two errors and a reference. Pick a target: human-level performance, or an existing baseline, sets the floor you can realistically reach. Now look at the gap from that target to your *training* error, and the gap from training error to *validation* error. A big first gap is bias; a big second gap is variance. That second gap is exactly the train-test gap you learned to watch — here it becomes a steering wheel, not just a warning light.

target (human/baseline) = 2%
train error             = 3%   -> bias gap  = 1%  (small)
val   error             = 12%  -> var. gap  = 9%  (big!)

=> low bias, HIGH variance: model overfits.
   fix: more data, regularize, simpler model, augment.

( if train error were 11% instead:
  bias gap = 9% (big) -> underfitting;
  fix: bigger model, train longer, better features. )

The two-gap diagnosis. The bigger gap names your problem and points at the fix.

The fixes are opposites, which is why the diagnosis must come first. For high bias you add capacity, train longer, or engineer better features. For high variance you do the reverse: gather more data, add regularization, simplify the model, or apply data augmentation. Apply a variance fix to a bias problem and you make things worse — a smaller model underfits even harder. This is also why the no-free-lunch intuition holds: there is no universal knob, only the right knob for *this* failure.

Learning curves: watch the gaps move

The single most informative picture in debugging is the learning curve: plot training error and validation error as you feed the model more data (or more training time). The two curves tell the whole story without you guessing. If both curves have flattened at a high error and sit close together, you are bias-bound — more data will not help, because the model already can't fit what it has; you need a stronger model.

If instead there is a wide and persistent vertical gap — training error low, validation error stubbornly high — that gap is variance, and the curves' shape tells you whether more data would close it (validation still trending down) or whether you've plateaued (it's flat). This is the honest answer to the perennial question "should I collect more data?" Don't argue about it; plot the curve and look.

Debugging is a loop, not a hunch

Put it together and [[model-debugging|model debugging]] stops being mysterious tinkering and becomes a disciplined loop. The mistake beginners make is changing five things at once and then not knowing which one helped; the discipline is to form one hypothesis from your error analysis, make one change, and re-measure on the same fixed validation set. The score moved or it didn't — either way you learned something true.

Read the mistakes. Sample 100+ errors, tag each with a cause, count the tags, and sort by impact.
Slice the score. Break the metric out by class, source, and group; hunt for slices that are far below average.
Diagnose bias vs variance. Compare target -> train -> validation gaps, and read the learning curve.
Change one thing. Apply the fix the diagnosis points to — capacity for bias, data/regularization for variance.
Re-measure on the same set, log what you tried, and loop until the error budget is met — not until you're tired.

Two honest cautions to close on. First, when you cannot understand why the model failed a case, lean on the interpretability tools from a later rung — a feature-importance readout or a saliency map can reveal that the model is keying on the watermark, not the tumor. Second, beware of debugging *against* your validation set so many times that you quietly start overfitting to it; that is why the final unbiased word always belongs to a test set you have touched only once. Error analysis makes models better, but it never makes them omniscient — a debugged model is a model whose remaining failures you finally *understand*, not one that has none.