Data Leakage & Other Traps

When a Great Score Is a Lie

By now you know the ritual from earlier in this rung: gather a dataset, split it, train a model, and read off a number. The trap is believing that number. A model can post a dazzling score on your test set and then fail miserably the day it meets real users — not because the math was wrong, but because the evaluation quietly lied to you. This guide is about the handful of subtle ways that lie creeps in, and how to catch it before it costs you.

Here is the unifying idea behind every trap in this guide: a model only learns what your data teaches it, and your test number only means something if the test honestly mimics the future. Break either of those — feed the model a hint it will not have in production, or test it on data that does not look like reality — and the score becomes theater. The model has not learned the task; it has learned your mistake.

Data Leakage: Information From the Future

The most dangerous trap is data leakage: when information that would not be available at prediction time sneaks into the features the model trains on. The model happily uses that hint, scores brilliantly in testing, and then falls apart in production — because in production, the hint is gone. Leakage is dangerous precisely because it looks like success. Nothing crashes; the metrics just quietly lie.

A classic example: you build a model to predict whether a patient has a disease, and one of your features is "was prescribed the disease's medication." Of course that predicts the disease almost perfectly — but only because the diagnosis already happened. The feature is a stand-in for the answer, not a clue you would have beforehand. Leakage often hides in such proxies: an account's "days since cancellation" when predicting churn, a row ID that happens to be sorted by outcome, or a timestamp that only exists after the event.

There is a sneakier sibling called preprocessing leakage. Suppose you scale your features by subtracting the mean and dividing by the standard deviation — a step you met in this rung's cleaning guide. If you compute that mean and standard deviation over the whole dataset before splitting, the statistics quietly carry information about the test rows into training. The fix is a discipline: fit every transformation on the training data only, then apply those frozen numbers to the validation and test sets.

# WRONG — statistics see the whole dataset, leaking test info
mean, std = compute_stats(all_data)
all_data = (all_data - mean) / std
train, test = split(all_data)

# RIGHT — fit on train only, then apply the frozen numbers
train, test = split(all_data)
mean, std = compute_stats(train)        # learned from train alone
train = (train - mean) / std
test  = (test  - mean) / std            # reuse, do not recompute

Split first, then fit any transformation on the training set alone — and reuse those exact numbers everywhere else.

Contaminated Splits: When Train and Test Are Secretly the Same

The whole point of the train/validation/test split is to keep a sealed exam the model has never seen. Contamination is when that seal breaks and bits of the test set leak into training, so the "exam" is really an open-book test on memorized answers. The result is a tiny train-test gap that flatters you: the model looks like it generalizes, when really it is reciting.

The most common cause is duplicate or near-duplicate rows that land on both sides of the split — the same news article scraped twice, the same customer appearing in two transactions, several photos of one object from slightly different angles. If one copy goes into training and its twin into the test set, the model has effectively already seen the answer. De-duplicate before you split, and when records are grouped (all rows from one patient, one user, one document), split by group so an entire group stays on one side.

Time series deserve special care. If your data has an order in time — prices, clicks, weather — never shuffle and split randomly, because that lets the model train on the future to predict the past. Always split chronologically: train on earlier data, test on later. The same warning applies to cross-validation; the convenient random-fold version silently breaks the moment your rows are grouped or time-ordered, so reach for group-aware or time-aware variants instead.

Class Imbalance: When Accuracy Lies

Imagine a fraud detector where 99.5% of transactions are legitimate. A model that simply answers "not fraud" every single time scores 99.5% accuracy — and catches exactly zero fraud. This is class imbalance: when one class vastly outnumbers another, accuracy becomes a flattering, useless number, because the boring majority answer is almost always right.

The cure starts with better metrics. Instead of accuracy, look at how the model does on the rare class specifically: precision (of the cases it flagged, how many were truly fraud?) and recall (of all the real fraud, how much did it catch?). These two trade off against each other, and which matters more depends on the cost of each kind of mistake — a missed tumor is not the same as a false alarm. Compare against a trivial baseline too: if "always predict the majority" already scores 99.5%, your model has to clear a high bar to mean anything.

Beyond metrics, you can rebalance the data itself with resampling — duplicating or synthesizing more of the rare class, or trimming the majority. Used thoughtfully it helps the model pay attention to the minority. But be honest about its limits: resampling cannot conjure information that is not there, it can encourage overconfidence, and you must resample only the training set — never the test set, which should keep the real-world proportions so your evaluation stays truthful.

Sampling Bias: When Your Data Isn't the World

Even a perfectly clean, leak-free split fails if the data was gathered from the wrong slice of reality. Sampling bias is when your training set systematically differs from the population you will actually deploy on — and the model faithfully learns that skew. A voice assistant trained mostly on one accent struggles with others; a skin-cancer detector trained on light skin underperforms on dark; a survey of people who answer phone calls misses everyone who does not. This is closely tied to dataset bias, and unlike leakage it can survive every check on your held-out test set, because the test set is biased in the very same way.

Sampling bias has a treacherous cousin: spurious correlation, a pattern that holds in your data by accident but not in general. In a famous case, an image classifier separated huskies from wolves with high accuracy — by detecting snow in the background, because the wolf photos happened to be snowy. The model took a shortcut: it found an easy signal that worked on this data and would shatter the moment a husky stood in the snow. Shortcuts are seductive precisely because they boost your test score while teaching the model the wrong thing.

The defense against these is partly statistical and partly human. Audit your data's composition against the real population; slice your metrics by subgroup instead of trusting one global number; and watch for distribution shift once the model is live, since the world drifts away from your snapshot over time. But no formula replaces asking the blunt question: who and what is missing from this data, and where might it be deployed that I never sampled?

A Practical Checklist

None of these traps require advanced math to avoid — they require discipline and suspicion. Run through this checklist whenever a result looks good, and especially when it looks great:

Split before you touch anything. Fit every transformation, imputation, and encoding on the training data alone, then apply the frozen numbers to validation and test.
Interrogate every feature: "Would I actually have this value at the moment of prediction?" If not, it is leakage — drop it.
De-duplicate, split by group, and split time series chronologically — never let a row or its twin appear on both sides.
For imbalanced data, ditch accuracy: report precision, recall, and the score of a trivial baseline. Resample the training set only.
Ask who is missing from the data, slice metrics by subgroup, and treat any too-good score as guilty until proven innocent.

This is the heart of what people now call data-centric AI: the realization that, beyond a point, the biggest gains come not from a fancier model but from cleaner, fairer, leak-free data and an evaluation you can trust. A modest model on honest data beats a brilliant model on a lie every time — because only one of them keeps working after it leaves your laptop.