Hypothesis Testing & Goodness of Fit

Putting a claim on trial

In the previous guides you learned to *estimate*: take messy data and produce a number, like fitting a Poisson mean to last year's claim counts. But estimation alone never tells you whether a *claim about the world* is believable. Suppose a pricing colleague insists, "our new safe-driver program cut the average claim frequency below 0.10 per policy." The data wiggle around; how do you decide whether that drop is real, or just luck? Hypothesis testing is the disciplined courtroom where such claims stand trial.

The trial has a deliberate asymmetry, just like a criminal court presuming innocence. We write down a null hypothesis — the boring, sceptical default, usually "nothing has changed" (the true frequency is still 0.10). Against it stands the alternative hypothesis, the interesting claim (frequency is now below 0.10). We do not try to *prove* the alternative directly. Instead we ask: *if the null were true*, how surprising is the data we actually saw? Only data that would be genuinely strange under the null earns the right to reject it.

The p-value, and what it is not

To measure "how surprising," we compute a test statistic from the data and then its p-value: the probability of seeing a result *at least as extreme* as ours, assuming the null is true. A small p-value means the observed data would be a rare fluke if nothing had changed — so the null looks shaky. Notice the central limit theorem doing quiet work here: it tells us what the test statistic's distribution looks like under the null, which is the whole reference against which "extreme" is judged.

We set a threshold called the significance level, written α, before looking — commonly 0.05. If the p-value falls below α we reject the null; otherwise we do not. That α is exactly the chance of a Type I error: rejecting a true null, a false alarm. Its mirror image is a Type II error: failing to reject a false null, a missed signal. The two trade off. Shrink α to avoid false alarms and you make the test slower to notice a real change; the power of a test — its chance of catching a genuine effect — is one minus that miss rate.

From testing a number to testing a whole shape

The trial above tested one number, a mean. But an actuary's deeper question is usually about *shape*: when fitting a loss distribution, you do not just want the right average — you want to know whether claim sizes really follow a Pareto, a lognormal, or something else entirely. Choosing wrongly quietly poisons every premium and reserve downstream, because the wrong shape mis-states exactly the rare large losses that matter most. So we need a test whose null hypothesis is an entire distribution: *the data came from this model*.

This family is called goodness-of-fit testing. The logic is identical to before — null, test statistic, p-value — but now the statistic measures the *gap between the data and a candidate distribution*. There is an honest subtlety worth flagging early: usually we first estimate the distribution's parameters from the very same data (via maximum likelihood, from an earlier guide). That makes the fit look better than it deserves, so the reference distributions must be adjusted — a detail the textbook tests handle by spending degrees of freedom or by simulating the critical values.

The chi-square test: counting in buckets

The chi-square goodness-of-fit test is the workhorse. The idea is homely: sort your data into a handful of buckets (say, claim sizes in ranges 0–1k, 1k–5k, 5k–20k, 20k+), count how many actually landed in each, and compare those *observed* counts with the *expected* counts your candidate distribution predicts. If the model is right, observed and expected should be close; big discrepancies are evidence against it.

chi-square = sum over buckets of (Observed - Expected)^2 / Expected

Bucket       Observed   Expected   (O-E)^2/E
0 - 1k          42         40        0.10
1k - 5k         28         33        0.76
5k - 20k        18         15        0.60
20k +           12         12        0.00
                                    -----
                          total =    1.46   -> small, fit looks fine

Each bucket contributes (Observed minus Expected) squared, divided by Expected; summing gives the chi-square statistic. A small total means observed and expected counts agree well.

Dividing by the expected count is the clever bit: it scales each gap by how big a deviation we should expect there by pure chance, so a busy bucket and a sparse one are judged fairly. The summed statistic is then compared against a chi-square reference distribution; a large value yields a small p-value and we reject the candidate model. Two honest cautions: the test needs each expected bucket count to be reasonably large (a common rule of thumb is at least 5), and the *choice of buckets* is yours — slice the same data differently and the verdict can shift, which is precisely why you fix the buckets before peeking at the answer.

The Kolmogorov-Smirnov test: no buckets needed

The bucket habit feels arbitrary, and for continuous data like claim amounts it throws away detail. The Kolmogorov-Smirnov (K-S) test avoids buckets entirely. Recall the cumulative distribution function from the probability rung — the running total of probability up to each value. The K-S test builds an *empirical* CDF straight from the data (a staircase that steps up by 1/n at each observation) and lays it over the *theoretical* CDF of the candidate distribution. Its statistic is simply the single largest vertical gap between the two curves anywhere along the line.

That single biggest gap is intuitive — it is the place where the data most disagree with the model. A small gap means the proposed curve hugs the data all the way along; a large one means somewhere the model badly mis-tracks reality, and a small p-value tells you to reject it. Compared with chi-square, K-S is bucket-free and sensitive across the whole range, which suits continuous severity data well. But be candid about its blind spot: K-S is most alert near the middle of the distribution and relatively *weak in the tails* — exactly where an actuary cares most, since the far tail holds the catastrophic losses. A model can pass K-S yet still understate the once-in-a-century claim.

No goodness-of-fit test ever *proves* a distribution is correct; the best it can do is fail to reject it. Real practice never leans on one test alone. You pair these statistics with eyeball checks — plotting the empirical against the theoretical curve, studying the tails directly — and with judgement about whether the model makes sense for the risk. The test is a smoke detector, not a verdict of truth.

Choosing well, and staying humble

Put the pieces together and a working recipe emerges. Propose a candidate distribution; estimate its parameters by maximum likelihood; then judge the fit with a goodness-of-fit test — chi-square when your data fall naturally into counts or categories, K-S (or its tail-sharper cousins) for continuous losses — backed always by plots. When several distributions all survive, you lean on the model-selection ideas from the wider toolkit and, above all, on which one behaves sensibly in the tail.

Tie this back to where the rung is heading. A goodness-of-fit verdict is a yes/no judgement on a single proposed distribution; the confidence intervals you met earlier put honest error bars around the parameters you estimated; and the next guides on regression generalise all of this — testing not just *which distribution* fits, but *which drivers* (age, region, vehicle type) genuinely move the losses. Testing and fitting are the same disciplined habit, scaled up.

End on the field's deepest humility. Every test here assumes the candidate distribution is a *fixed, known shape* and the data are clean and independent — assumptions that bend under real claims data, which often clump, drift over time, and arrive with errors. Passing a test means "not yet contradicted by this data," never "true." The model is a map, never the territory; the responsible actuary keeps testing it as fresh losses arrive and stays especially wary in the tail, where confident models have failed most expensively.