Loss Functions: Measuring Wrongness

The number a model is trying to shrink

You already know that a model takes inputs and produces a prediction, and that during training it adjusts its parameters to do better. But "do better" is hopelessly vague to a machine. A computer cannot chase a feeling of improvement; it can only chase a number. The loss function is that number: a single scalar that scores how wrong the model's prediction is, for one example, right now. Lower is better; zero would mean a perfect hit.

Three words get thrown around almost interchangeably, and it helps to pin them down. The loss usually means the error on a single example. The cost is the average loss over a batch or the whole dataset — the thing we actually try to minimize. The objective is whatever final expression the optimizer works on, which is often the cost plus extra terms (like a penalty for complex models). In casual conversation people say all three to mean "the thing we make small," and that is fine — just know that the optimizer always lands on one concrete number to push downhill.

Regression: squaring the gap with MSE

When the model predicts a number — a house price, tomorrow's temperature, a person's age — we measure wrongness as the gap between the prediction and the true label. The classic choice is the Mean Squared Error (MSE): take each gap, square it, and average over all examples. Squaring does two jobs at once. It makes every error positive (so a prediction that is 3 too high doesn't cancel one that is 3 too low), and it punishes big misses far more harshly than small ones — an error of 10 costs a hundred, an error of 1 costs only one.

That harshness is a double-edged sword, and being honest about it matters. Because a single huge error dominates the average, MSE is very sensitive to an outlier — one mislabeled or freak data point can drag the whole model toward it. When that worries you, Mean Absolute Error (which averages the unsquared distances) treats all errors more evenly and shrugs off outliers. There is no universally best answer here; the no free lunch theorem reminds us that the right loss depends on the data and the cost of mistakes in your world.

MSE has one more quiet virtue that makes it the default for linear regression and a thousand other models: it is smooth and bowl-shaped. Its slope changes gently and predictably, which is exactly what the next guide's gradient descent needs to find its way downhill. A loss the optimizer can navigate easily is worth a great deal.

Classification: why we don't just count mistakes

Now suppose the model must pick a category: spam or not, cat or dog, which of ten digits. The honest scorecard you care about is accuracy — what fraction did it get right? But accuracy makes a terrible loss for learning. It is flat almost everywhere: nudge a parameter a tiny bit and the count of correct answers usually doesn't change at all, so there is no slope to tell the optimizer which way to step. We need a loss that responds smoothly to how confident the model was, not just whether it was ultimately right.

The fix is to have the model output probabilities — say, 0.9 that this email is spam — usually via a sigmoid for two classes or a softmax for many. Then we score it with cross-entropy, also called log loss. The idea is beautifully simple: look only at the probability the model assigned to the true answer, and the loss is the negative logarithm of that probability. Confidently correct (probability near 1) costs almost nothing. Confidently wrong (it gave the true class a probability near 0) costs a huge amount, because the log of a tiny number plunges toward infinity.

# cross-entropy for one example
p_true = model_probs[ correct_class ]   # e.g. 0.9
loss   = -log( p_true )                 # 0.9 -> 0.105 (cheap)
                                        # 0.01 -> 4.6  (painful)
# average this over all examples to get the cost

Cross-entropy looks only at the probability you gave the right answer, then takes its negative log — confident mistakes are punished steeply.

This is not an arbitrary formula. Minimizing cross-entropy is mathematically the same as maximum likelihood estimation — making the observed data as probable as possible under the model — and it is the KL divergence between the true labels and the model's guesses, rounded to its essential part. That deep link is why cross-entropy, not raw accuracy, trains nearly every classifier from logistic regression to giant language models.

From one example to a learning rule: empirical risk

A loss scores one example, but we want a model that does well across the board. In an ideal world we would minimize the model's average loss over the entire universe of possible inputs — its true risk. We can't: we never see that whole universe, only the dataset we collected. So we settle for the average loss over the data we actually have. That average is the empirical risk, and minimizing it is the whole game, a recipe with a grand name: empirical risk minimization.

Pick a loss function that matches the task (MSE for numbers, cross-entropy for categories).
Run the training data through the model and compute the loss for each example.
Average those losses to get the empirical risk — one number summarizing total wrongness.
Adjust the parameters to make that number smaller, then repeat.

What the loss does and doesn't tell you

A falling loss curve feels like progress, and usually it is — but read it with a skeptic's eye. The loss is the quantity you optimize; it is rarely the quantity a human ultimately cares about. A spam filter might post a gorgeous cross-entropy while quietly letting through the one scam that matters, because the loss treated every email as equally important. When the classes are lopsided — a class imbalance like 99% legitimate mail — a model can score a low loss by ignoring the rare class entirely.

That gap is why loss and evaluation metrics live separate lives. You train against a smooth, differentiable loss because the optimizer needs slopes; you judge with the metric you truly care about — accuracy, precision, recall, dollars saved — even when it is bumpy and unoptimizable. A sensible workflow always sets a baseline first (what loss does a trivial guesser get?) so you can tell whether a fancy model is actually earning its keep or merely reciting the obvious.

So hold the loss in proper regard: it is the indispensable compass that turns vague "do better" into a concrete direction a machine can follow, and you have now met the two that power most of modern learning. But a compass is not the destination. With the loss chosen, the only question left is how to actually move the parameters to shrink it — which is exactly the journey down the slope that the next guide takes up.