Why ordinary regression breaks on insurance data
In the previous guide you met linear regression and its big sister multiple regression: draw the best straight line (or flat plane) through a cloud of points, and read off how each driver nudges the outcome. It is a beautiful tool — but it carries hidden assumptions in its luggage. Ordinary regression quietly believes the outcome can be any number, that the scatter around the line is the same size everywhere, and that the errors pile up in a symmetric bell curve. For a great deal of the world, that is close enough. For insurance data, it is wrong on all three counts.
Think about what an actuary actually models. The number of claims a policy files in a year is a count: 0, 1, 2 — never 1.7, and never negative. The size of a claim is a positive, skewed amount: most are modest, a few are enormous, and none can be below zero. A straight-line model fit to these will cheerfully predict negative claim counts for safe drivers and assume the variability of a $200 windscreen claim is the same as that of a $2,000,000 fire — both nonsense. The data is not misbehaving; the tool is simply the wrong shape.
This is exactly the frequency–severity split you have seen before, now seen through a statistician's eyes. Frequency (how often) looks like a Poisson count; severity (how big) looks like a long-tailed, strictly-positive gamma or lognormal amount. Ordinary regression assumes a normal bell curve for both. We need a way to keep the elegant idea of regression — combining many drivers into one prediction — while swapping in a distribution that matches reality.
The two dials a GLM lets you turn
The generalized linear model, or GLM, is the answer, and it is a remarkably small twist on what you already know. A GLM keeps the familiar engine of regression — a weighted sum of your drivers, like rate = b0 + b1·age + b2·region — but adds two adjustable dials so the model can fit data that is not a bell curve. Almost the whole of modern non-life pricing runs on this one idea.
The first dial is the distribution (statisticians say the response family). Instead of forcing a normal bell curve, you tell the model what shape the outcome really has: choose Poisson for claim counts, gamma for claim sizes, binomial for yes/no events like whether a policy lapses. The model then judges how well it fits using that honest shape rather than pretending everything is symmetric scatter.
The second dial is the link function, and it is the cleverer of the two. The link decides how the weighted sum of drivers connects to the prediction. A plain regression links them directly (add the pieces, that's your answer — which can go negative). A GLM can instead use a log link, which says: add the pieces, then take e-to-that. Because e-to-anything is always positive, the prediction can never dip below zero — perfect for claim counts and costs. Even better, a log link turns adding into multiplying: each driver becomes a multiplicative factor on the base rate, which is exactly how insurance tariffs have always been built.
Base rate = 500 Male, under-25 x 1.40 Urban territory x 1.25 No prior claims x 0.80 ------------------------------------- Premium = 500 x 1.40 x 1.25 x 0.80 = 700 (A log-link GLM learns those factors: log(rate) = log(500) + 0.336 + 0.223 - 0.223 )
How a GLM learns its numbers, honestly
How does the model pick its coefficients? Ordinary regression minimizes squared errors, which is the right thing to do only when the scatter is a normal bell curve. A GLM instead uses maximum likelihood — the method you met two guides ago. In plain words: among all possible sets of coefficients, choose the ones that make the data you actually observed the least surprising. Because you have already told the model the true distribution, this honestly accounts for the fact that big claims are rare and counts cannot be negative.
The payoff is that a fitted GLM hands you genuine actuarial quantities, not just abstract slopes. Run a Poisson GLM on frequency and you get an expected claim count per policy; run a gamma GLM on severity and you get an expected claim size; multiply them and you have a pure premium for each cell of risk. This is the heart of how a modern personal-lines insurer prices millions of policies, each with its own blend of rating variables. The output is not a single average rate but a tailored price built from the drivers that genuinely move risk.
Machine learning: the honest promise and the limits
Beyond the GLM lies the wider world of predictive analytics and machine learning: gradient-boosted trees, random forests, neural networks. The honest case for them is real. They can sniff out interactions and curved patterns a human would never think to write down — that the effect of car power depends on the driver's age in a wiggly, non-multiplicative way, say — and they often predict pure premium more accurately than a hand-built GLM. On a leaderboard measured purely by predictive error, they frequently win.
But raw predictive accuracy is not the only thing an actuary is paid to deliver, and here the limits bite hard. Insurance is a regulated business. In most jurisdictions an insurer must file its rates and be able to explain, factor by factor, why one customer pays more than another. A regulator will not accept "the neural network said so." The price must be justifiable, must not use prohibited or proxy-discriminatory variables, and must be stable enough that two near-identical customers are not quoted wildly different premiums. A black box that cannot answer "why this price?" is, in this setting, unusable however accurate it is.
There is a second, deeper trap: a flexible model will memorize the noise in your data if you let it. With enough trees or layers it can fit the past almost perfectly — including the random quirks that will never repeat — and then predict the future badly. This is overfitting, and the only honest defence is to test the model on data it has never seen. A model is not reality; it is a map fitted to one stretch of road, and the only question that matters is whether it still works on the road ahead.
The actuary's responsibility behind the model
Whichever model you reach for, the deepest limit is the same one that has run through this whole rung: a model is only as honest as the data and the assumptions you feed it. Garbage in, garbage out is not a cliché here — it is a professional risk. If your historical data already encodes a human bias, a model will faithfully learn and amplify that bias while looking perfectly objective. Data quality and ethics are not a footnote to predictive analytics; they are the load-bearing wall.
So the modern actuary's job is not to be beaten by the algorithm but to govern it. That means knowing the GLM well enough to read what it is saying, knowing machine learning well enough to use it where it genuinely helps, and having the professional spine to say "we will not file this" when a model is accurate but unexplainable, unstable, or unfair. The signature at the bottom of a rate filing is a promise about more than predictive error.
That closes the Statistics & Data rung. You began by learning the model and reading uncertainty off it; you end able to learn the model from messy reality, test it honestly, and judge when a clever model has outrun what you can responsibly defend. The regression machinery you now hold — the GLM above all — is the bridge from textbook probability into the working world. Next the ladder turns to interest theory, where these same disciplined instincts get pointed at the time value of money.