A model is not a finished product
By this point on the ladder you can train a strong model, evaluate it honestly, serve it at low latency, and even shrink it to fit a phone. So you ship it, the dashboards go green, and you move on. That feeling — "done" — is the single most expensive mistake in production ML. A trained model is frozen: it captured the statistical shape of the world on the day its training data was collected. But the world keeps moving, and a frozen model in a moving world slowly drifts out of focus.
Here is the uncomfortable part: this decay is usually invisible. The model never crashes. It returns a confident answer for every request, with the same latency as on day one. There is no error in the logs, no red alert — just predictions that are quietly, increasingly wrong. A fraud detector trained on last year's scams keeps scoring this year's new scams as safe. This is why monitoring is not optional polish; it is the only thing standing between you and a system that is failing silently while every traditional health check stays green.
What to watch: four layers of signal
Good model monitoring watches the system in layers, from the cheapest and fastest signal to the truest and slowest. You want the early layers because they warn you in seconds; you want the deep layers because only they tell you whether the model is actually still right.
- Operational health — latency, error rate, throughput, memory. The plumbing. This is ordinary software monitoring and it catches outages, not drift, but it must be there first.
- Input distribution — the statistics of incoming features: their means, ranges, and missing-value rates. A column that was 5% null and is now 60% null is often a broken upstream pipeline, not the world changing.
- Prediction distribution — the shape of the model's own outputs and its confidence. If a loan model that used to approve 30% of applicants suddenly approves 70%, something shifted even before you know any outcomes.
- Live quality — the real metric (accuracy, precision/recall, error) measured against ground-truth labels as they arrive. This is the only layer that proves the model still works — and it is the hardest and slowest to get.
The painful gap is between layers 3 and 4. Inputs and predictions are available the instant a request arrives, but the true label often comes much later — sometimes never. You learn whether a loan defaulted only after months; you may never learn the "right" answer for a content recommendation. So monitoring is partly the art of using the fast, cheap signals (layers 2 and 3) as a proxy for the slow, true one (layer 4) — while never forgetting they are only a proxy.
Data drift vs concept drift
"The world changed" is too vague to act on. Drift comes in two genuinely different flavours, and they call for different responses. Data drift (also called covariate shift, a kind of distribution shift) means the inputs changed: you are now seeing kinds of examples that were rare or absent in training. The relationship between input and answer is unchanged — you are just being asked questions outside the region your model learned well.
Concept drift is deeper and nastier: the relationship itself changed. The same input now maps to a different correct answer. Picture a spam filter. If spammers simply start writing in a new language, that is data drift — new inputs, same rule ("unsolicited bulk mail is spam"). But if the very definition of what users consider spam shifts — say, marketing emails they once tolerated now get reported — that is concept drift. The ground truth moved, so a model that perfectly memorised the old rule is now confidently applying an outdated one.
The distinction matters because the cure differs. Pure data drift can sometimes be fixed without retraining — by collecting and labelling examples from the new region, or by widening the input range you cover. Concept drift almost always demands fresh labels reflecting the new reality, because no amount of old data teaches the new rule. And both must be told apart from a third culprit that looks identical on a dashboard: a plain data-quality bug — a renamed column, a unit change from dollars to cents, a sensor recalibrated. Always rule out the broken pipe before you blame the changing world.
When (and whether) to retrain
There is a seductive fantasy that the right answer is "just retrain constantly on the freshest data." Resist it. Retraining is not free and not safe: every new model is a new artifact that can regress, can learn a new spurious pattern, can break in production. Continuous, unattended retraining also opens a quiet door to feedback loops — your model influences user behaviour, that behaviour becomes tomorrow's training labels, and the model starts learning from its own past decisions until it spirals away from reality.
So the real question is not "how often" but "on what trigger." Teams generally pick one of three retraining policies, and mature teams blend them: scheduled (retrain every week or month — simple, predictable, but blind to sudden shifts), performance-triggered (retrain when a monitored metric crosses a threshold — efficient, but needs ground-truth labels to fire), and drift-triggered (retrain when input or prediction drift exceeds a bound — works without labels, but can fire on noise). The right choice depends on how fast your world moves and how quickly your labels arrive.
Retraining itself usually means re-running your training pipeline on a window of recent data. Two sub-choices matter: whether to retrain from scratch or warm-start from the current weights (incremental updates are cheaper but can drift faster), and how wide a data window to use — recent enough to capture the new reality, long enough not to forget rare-but-important cases. Crucially, a retrained model is a candidate, not a replacement: it must beat the incumbent on a fresh held-out set before it earns the right to ship.
Closing the loop
Everything above only works if it forms a loop rather than a one-shot launch. A healthy production ML system runs a continuous cycle: monitor the live model, detect and diagnose drift, gather fresh labelled data, retrain a candidate, validate it offline, then roll it out carefully — and start monitoring again. This is the operational heart of the CI/CD-for-ML discipline introduced earlier in this rung. The loop is the product.
monitor ──▶ detect drift ──▶ collect & label ▲ │ │ ▼ roll out ◀── validate offline ◀── retrain candidate │ └─ shadow / canary / A-B (never a blind swap)
Notice the last step is never a blind swap. You roll a new model out in stages: shadow mode runs it silently alongside the old one so you can compare without risk; a canary sends it a small slice of real traffic; and an A/B test measures whether it genuinely beats the incumbent on the metric you care about. If it underperforms, you roll back instantly. This is exactly why the earlier guides insisted on a model registry and experiment tracking — you cannot safely roll back to, or audit, a model you cannot name and reproduce.
One honest closing note. Not every problem is solved by closing this loop tighter. Sometimes the real fix is a human in the loop for the hard cases, a simpler and more robust baseline, or even retiring a model whose task has changed beyond what any retraining can chase. The mark of a mature ML practitioner is not building the most aggressive auto-retraining machine — it is knowing that a model lives in a changing world, watching it with clear eyes, and intervening with judgement rather than reflex.