Regression: Explaining One Thing With Others

A line through the mess

So far this rung has taught you to estimate a single number and to test a single claim. But the questions an actuary actually gets paid to answer almost never come one variable at a time. They sound like this: as a driver gets older, how does the cost of insuring them change? Does smoking really push up medical cost, and by how much? Regression is the tool for exactly these questions — the discipline of explaining one thing (the outcome) with one or more others (the predictors).

Picture a scatterplot. Each policyholder is a single dot: their age along the bottom, their annual cost up the side. The cloud of dots clearly drifts upward — older, costlier — but it is a cloud, not a curve. No single line passes through every dot. Simple linear regression makes a humble, powerful bet: that beneath the scatter there is a straight-line tendency, cost is roughly a starting amount plus a steady rate per year of age, and the rest is just the noise of individual lives. We write it cost = a + b·age + error, and the whole game is to choose the line — the numbers a and b — that fits the cloud best.

How do we decide which line is best? For any candidate line, each dot sits a little above or below it. That vertical gap — what actually happened minus what the line predicted — is the line's error for that point. Least squares is the rule that picks the one line making the total of all those gaps, each squared first, as small as possible. We square the gaps for the same honest reasons we squared deviations when building the variance: a miss above and a miss below should both count as error rather than cancelling, and a large miss should hurt far more than a small one. The line that wins this contest is the regression line.

Reading the slope and the intercept

Once least squares hands you the winning line, the two numbers it found tell a story you can say out loud. The slope b is the rate of change: how much the outcome moves for each one-unit step in the predictor. The intercept a is the value the line predicts when the predictor is zero — where the line crosses the cost axis. Suppose our fitted line for medical cost comes out as cost = 400 + 70·age. The slope of 70 says: each additional year of age is associated with about 70 more in annual cost. The intercept of 400 is the line's prediction at age zero.

There is a beautiful link back to the previous track here. The slope of the regression line is intimately tied to the covariance between predictor and outcome: it is the covariance of the two divided by the variance of the predictor. So regression is not a brand-new idea bolted on from outside — it is the co-movement you already met, reshaped into a usable prediction. A positive correlation tilts the line upward; a stronger correlation packs the cloud more tightly around it. Regression simply turns "these two tend to move together" into "give me an age and I will give you a predicted cost."

The residual: where the model confesses

We minimised the squared gaps to find the line — but those gaps do not vanish once the line is drawn. Each leftover gap, computed against the final fitted line, has a name: the residual. For every data point, residual = the value that actually happened minus the value the model predicted. A 52-year-old who cost 4,200 against a line predicting 400 + 70·52 = 4,040 has a residual of +160: the model under-predicted them by 160. Residuals are not failures to apologise for. They are the model's honest confession of everything it could not explain.

And that confession is where the real diagnostic power lives. If your line genuinely captured the pattern, the residuals left behind should look like featureless random noise — scattered evenly above and below zero, with no shape. So you plot the residuals and you go hunting for shape, because any shape is the model telling you what it missed. A residual cloud that curves like a smile means the true relationship was never a straight line. A residual fan that widens as predictions grow means the spread is not constant. A handful of residuals stranded far from the rest are outliers — perhaps a few catastrophic claims quietly dragging the whole line toward themselves.

When one predictor is not enough

Real outcomes rarely lean on a single cause. Medical cost rises with age, yes — but also with smoking, with region, with chronic conditions. Multiple regression lets every predictor act at once inside one equation: cost = a + b1·age + b2·smoker + b3·region + error. Least squares still does the work, now choosing several coefficients together so the squared residuals are as small as possible across all of them at the same time. The line becomes a tilted plane in many dimensions, but the idea is unchanged: find the surface that the data hug most closely.

This unlocks the question that makes the technique so prized. Each coefficient in multiple regression is the effect of its predictor while holding all the others fixed. Read cost = 400 + 70·age + 1,200·smoker carefully: the 70 is now the cost of a year of age after accounting for smoking status, and the 1,200 is the cost of being a smoker after accounting for age. Without this, age and smoking would be tangled — if older people in your sample happened to smoke more, a single-variable model would blame age for damage smoking actually did. Multiple regression is the machine that untangles overlapping influences.

Two honest cautions sharpen as predictors multiply. First, when two predictors are themselves strongly correlated — multicollinearity — the model struggles to credit the effect to one or the other, and their individual coefficients turn unstable and hard to read, even when the overall predictions stay fine. Second, piling in more variables always nudges the in-sample fit upward, which tempts overfitting: a model so eager that it starts memorising the random quirks of this dataset rather than the real signal. That is why honest measures like adjusted R-squared, AIC, and BIC reward a model for fitting well only after charging a penalty for every extra knob it adds.

From the straight line to the actuary's workhorse

Ordinary linear regression makes assumptions that insurance data cheerfully violates. It expects the outcome to be any number on a straight line with constant spread — yet claim counts are non-negative whole numbers, claim sizes are positive and badly right-skewed, and a probability must stay between 0 and 1. The generalized linear model, or GLM, is the elegant repair. It keeps the familiar machinery of adding up weighted predictors, but lets the outcome follow a fitter distribution — Poisson for counts, gamma for skewed amounts — and connects predictors to outcome through a link function, most often a log link so the factors multiply rather than add.

This is exactly how a modern insurer prices personal cover. A Poisson GLM models how often you claim; a gamma GLM models how much each claim costs; multiply the two and you have rebuilt the pure premium from the foundations track. Rating variables — age, vehicle, territory, prior claims — each enter as a multiplicative relativity that scales a base rate up or down. The coefficients are no longer fitted by least squares but by maximum likelihood, the estimation principle from earlier in this rung: choose the parameters that make the data you actually observed as probable as possible. Regulators accept GLMs precisely because every coefficient has a plain, defensible meaning.

Fitted multiple regression:  cost = 400 + 70*age + 1200*smoker

 A 52-year-old smoker:
   predicted = 400 + 70*52 + 1200*1 = 5240
   if actual cost was 5500 -> residual = 5500 - 5240 = +260

GLM (auto pricing), log link makes factors MULTIPLY:
   premium = base * f(age) * f(territory) * f(vehicle)
   e.g.    = 300  * 1.40   * 0.90        * 1.15  = 435

Top: a fitted multiple-regression prediction and the residual it leaves behind. Bottom: how a log-link GLM turns each rating variable into a relativity that multiplies a base rate — the everyday shape of insurance pricing.

The warning the field repeats most

Now the discipline's favourite warning, made concrete: correlation is not causation. A regression coefficient measures association — that two things move together — never proof that one causes the other. Ice-cream sales and drowning deaths rise together across the year, with a tidy positive slope, but neither causes the other; summer heat drives both. Drop heat into the model and the spurious link between cones and tragedy collapses. A regression cannot smell a lurking third variable; it will faithfully fit whatever co-movement you feed it and report it with a confident-looking coefficient.

Its twin is just as important: a good fit is not a true model. A model that hugs your historical data beautifully may have learned the noise rather than the world, and the one that fits the past best is almost never the one that predicts the future best — the lesson of overfitting. The only honest verdict comes from data the model has never seen: a hold-out sample or cross-validation, fitting on part and measuring error on the rest. Every model also quietly bakes in its assumptions — the right distribution, the right variables, a relationship that is genuinely linear on its chosen scale — and it captures an interaction between predictors only if you put that interaction in by hand.

None of this makes regression less powerful — it makes a regression user trustworthy. The same caution carries straight into machine learning, where forests and boosted trees and neural networks predict more sharply yet still only find correlation, never causation, and can quietly learn to use a postal code as a proxy for something it must never price on. The actuary's job is shifting from hand-building every model to validating, governing, and standing behind it. Whatever the tool, the discipline is the same one this whole rung has been teaching: estimate honestly, test honestly, and never mistake a number that fits for a truth about the world.