Accuracy Isn’t Enough

The 99% Trap

Imagine you build a model to screen for a rare disease that affects 1 in 100 people. You train it, you test it, and it reports 99% accuracy. You might pop the champagne. But here is a model that ties it without learning anything at all: a single line of code that says "healthy" for everyone. Since 99 out of 100 people really are healthy, that lazy rule is also 99% accurate — and it misses every single sick patient. Accuracy just counted how often the prediction matched the truth, and on lopsided data that number can be loud and meaningless at the same time.

This is the heart of the problem called class imbalance: when one outcome is far more common than the other, a single headline number hides the only mistakes you care about. The fix is not a better model first — it is a better question. Instead of "how often is it right?", we ask "right about what, and wrong in which direction?" Answering that needs a tool that pulls the single number apart.

The Confusion Matrix: Four Boxes

Every prediction in a two-class problem lands in one of four boxes, and the confusion matrix is simply the grid that counts them. There are two ways to be right and two ways to be wrong. A true positive is a sick patient your model correctly flags as sick. A true negative is a healthy person correctly cleared. The dangerous ones are the errors: a false positive is a healthy person wrongly flagged (a false alarm), and a false negative is a sick person wrongly cleared (a miss). Accuracy lumps all four together; the matrix keeps them apart.

                Predicted
              Sick    Healthy
Actual Sick    TP=8     FN=2     <- 2 patients MISSED
       Healthy FP=5     TN=985

(rare disease: only 10 sick out of 1000)

The four cells for our disease screener. Accuracy is high (993/1000), yet it misses 2 of 10 real patients.

Precision and Recall: Two Different Worries

From those four boxes come the two metrics that matter most. Precision asks: of everything the model flagged as positive, how many really were? It is true positives divided by all positive predictions — the answer to "when it raises the alarm, can I trust it?" Recall asks the opposite: of everything that really was positive, how many did the model catch? It is true positives divided by all actual positives — the answer to "how much did it let slip through?"

Which one you should care about depends entirely on what a mistake costs. For our disease screener, a missed patient (false negative) can be fatal, while a false alarm just means an extra test — so we want high recall, even at the price of some precision. Flip the stakes for a spam filter: wrongly trashing an important email (false positive) is worse than letting one junk message through, so there we lean on precision. There is no universally "good" metric; the right one is dictated by the job.

F1: One Number, Honestly Built

Sometimes you do want a single score to rank models or watch over time. The F1 score is the honest way to combine precision and recall: it is their harmonic mean, not the ordinary average. The harmonic mean punishes imbalance — if either precision or recall is near zero, F1 is dragged down near zero too. So you cannot game it by acing one and ignoring the other, which is exactly why it beats accuracy on imbalanced data.

Run our lazy "everyone is healthy" model through this lens and it collapses instantly. It catches zero real patients, so recall is 0, so F1 is 0 — even though accuracy was a glittering 99%. The metric finally tells the truth the headline number hid. But do not treat F1 as a magic verdict either: it weights precision and recall equally, which may not match your real costs, and it ignores true negatives entirely. It is a sharper tool, not the final word.

Choosing Metrics Without Fooling Yourself

The discipline of honest evaluation is a short loop you run before you trust any number. It starts with the costs of your two error types and ends with a metric chosen on purpose — not whichever one happens to look best.

Write down what a false positive costs and what a false negative costs — in real-world terms, not points.
Check the class balance. If one class is rare, cross accuracy off your list immediately.
Build the confusion matrix on your held-out test set and look at all four boxes before computing anything.
Pick the metric that matches the costs: recall when misses hurt most, precision when false alarms hurt most, F1 when you need one balanced number.
Compare against a dumb baseline — the "always predict the majority" model — so you know your score reflects real skill, not just the imbalance.

These metrics carry forward through the rest of this rung. When you compare entries on a leaderboard, the top score is often accuracy or a single F1 — and the lessons here are exactly why a leaderboard rank can flatter a model that quietly fails on the cases that matter. Honest measurement is not a formality before the real work; on imbalanced, high-stakes problems, it is the work.