Severity Models & Heavy Tails

The other half of the machine

You have already built one half of the loss machine: in the frequency–severity decomposition you split the cost of a book of business into *how many* claims arrive and *how big* each one is. The earlier guides in this rung pinned down the frequency side — the Poisson and negative binomial counting models. This guide turns to the second half, the severity: given that a claim has happened, how many dollars does it cost? Modelling that is the job of a claim severity distribution, a continuous distribution living on the positive numbers.

Why model severity at all — why not just average the past claims and be done? Because the average alone hides what kills insurers. Two books can share the same average claim of $4,000, yet one is a steady stream of $4,000 fender-benders while the other is a thousand $500 scratches plus the rare $2,000,000 lawsuit. The averages match; the *shapes* could not be more different, and only one of them threatens solvency. A severity distribution captures that whole shape — where the mass piles up, how far the right arm reaches, and how much of next year's total a single claim might claim for itself.

The cast of severity distributions

You already met these families when you first toured the actuary's toolkit; here we put them to work as competitors for the same job. The exponential distribution is the simplest honest sketch — one parameter, a constant decay, the *memoryless* property that the chance a loss grows by another $1,000 never changes no matter how large it already is. It is a fine first guess, but real claim data almost always wants more flexibility than its single dial allows.

The gamma distribution adds a shape parameter, so the curve can rise to a hump before falling — good for moderate claims that cluster around a typical size. The Weibull distribution adds its own shape dial that controls how the tail behaves: tune it one way and the tail is light and well-behaved; tune it the other and it stretches out, which is why engineers love it for failure and wear-out times. The lognormal distribution tells a *multiplicative* story — a loss built from many random factors multiplied together (a repair that depends on parts × labour × delay × severity of impact) has a logarithm that is normal, and a heavily right-skewed shape that fits a surprising amount of real-world claim data.

And then there is the Pareto distribution, which belongs in a category of its own and gets the whole final section. For now, line the others up by tail weight, lightest to heaviest: exponential and gamma decay fast, the lognormal trails further, the Weibull can go either way, and the Pareto refuses to fade. Choosing among them is not a beauty contest — it is a bet about how the *largest* losses behave, the very losses you have the least data on.

Fitting a severity distribution

Picking a family is only half the work; you must also choose its parameters so the curve actually matches your claims. This is loss distribution fitting, and it follows the estimation ideas you learned earlier, now aimed at claim amounts. The quickest route is the method of moments: compute the sample mean and variance of your past claims, then pick parameters that reproduce them. It is fast and gives a sensible starting point, but it leans on low moments and so pays little attention to the tail — which, for severity, is exactly where the money is.

The workhorse instead is maximum likelihood estimation: choose the parameters under which the claims you actually saw were most probable. It uses every data point, comes with standard errors so you know how shaky each estimate is, and — crucially for insurance — it can be written to honour *censored* and *truncated* data. That matters because raw claim data is rarely clean: a policy limit caps what you observe (a $3,000,000 loss on a $1,000,000 policy is recorded as exactly $1,000,000), and a deductible truncates the small claims that never get reported at all. A fit that ignores these distortions will badly misjudge the tail.

Choosing among the candidates

Suppose you have fitted three candidates — say a gamma, a lognormal, and a Pareto — to the same claims. Which do you trust? You judge them on three counts, in roughly this order.

Does the story fit? Before any arithmetic, ask whether the distribution's underlying story matches the risk. Multiplicative damage suggests lognormal; a few catastrophic claims among many small ones suggests Pareto. A model whose story is wrong will mislead even when the numbers look fine.
How good is the fit, with an honest penalty for complexity? Run goodness-of-fit tests and compare the candidates. A distribution with more parameters can always hug the data more tightly, so reward fit but penalise extra parameters — otherwise you reward overfitting noise rather than capturing the real shape.
How well does the tail match — and how stable is it? Look hardest at the largest claims, because that is where candidates that agree on the body disagree most. Then refit on slightly different data (or drop the single biggest claim) and see how much the answer moves. If it lurches, your tail estimate is fragile, and you should lean conservative.

Notice what is *not* on the list: "whichever has the smallest error in the middle." Two distributions can match almost perfectly across the bulk of ordinary claims and yet imply tail probabilities that differ by a factor of ten. For pricing a high policy limit or a reinsurance layer, those tail probabilities are the entire question. A severity model is chosen for the part of the picture you can barely see — which is exactly why honesty about its uncertainty matters more here than almost anywhere else in actuarial work.

Heavy tails: when one claim is the year

Now the crux. A heavy tail means the probability of an enormous loss fades *slowly* — as a power of the loss size rather than exponentially. The cleanest example is Pareto severity: double the loss threshold and the probability of exceeding it falls only by a fixed fraction, no matter how high you have already climbed. The practical consequence is startling. Under a light tail, your hundred ordinary claims and your one big claim all sit in roughly the same size range. Under a heavy tail, the single largest claim of the year can exceed the sum of all the others combined.

This is the world of catastrophe and liability insurance — earthquakes, hurricanes, mass-tort lawsuits, pandemic claims. A heavy tail breaks the comfortable intuitions you have leaned on. The sample average converges *painfully* slowly, because it keeps waiting for the next freak loss that will yank it upward, so ten quiet years tell you far less than they seem to. In the most extreme heavy-tailed cases the mathematical variance, or even the mean itself, is *infinite* — meaning no amount of past data ever pins the average down, and the comforting central-limit picture you met earlier simply does not apply.

Pareto tail: P(loss > x) ~ (b / x)^a      (a = tail index)

  P(loss > $1,000,000) = 0.0100
  P(loss > $2,000,000) = 0.0050   (halve again per doubling, a=1)
  P(loss > $4,000,000) = 0.0025

Light (exponential) tail for contrast:
  P(loss > $1,000,000) = 0.0100
  P(loss > $2,000,000) = 0.0001   (vanishes far faster)

The same 1% chance of a $1M loss, two very different worlds. Under the power-law Pareto a $4M loss is a quarter as likely as $1M; under the exponential it has all but disappeared. Pricing a high layer, that gap is everything.

So treat the tail with humility. Choosing a light tail because it fits the everyday claims and looks tidy is one of the classic roads to ruin: the insurer books healthy profits through the calm years, then a single tail event it never priced for arrives and wipes out a decade of earnings. This is also why tail-aware risk measures matter — a measure like tail value at risk looks at the *average* size of losses beyond a threshold, capturing the depth of the tail that a simple value-at-risk cutoff steps right over. The model is not the risk; the tail is where that gap does its damage.