Estimating From Data: MLE & Method of Moments

The question probability never asked

Everything in the probability rung began with a model already chosen and its dials already set: a Poisson distribution with mean exactly 3 claims a year, a payout with a known expected value. From there you could compute anything. But notice the quiet assumption — somebody handed you that 3. Where did it come from? In the real world nobody hands it to you. You inherit a spreadsheet of last year's claims, and the model's dials are unknown. Statistics is the discipline of turning the question around: given the data, what were the dials probably set to?

Recall the distinction from the start of this rung between the population and the sample. The population is the whole, usually unknowable, truth — every claim that could ever arise from this kind of policy, governed by some true parameter we will call θ (theta). The sample is the modest pile of data we actually observed. Point estimation is the act of producing a single best number for θ from that sample — a point estimate, like 'the mean claim is probably about 1,840.' This guide builds two honest ways to manufacture that number, and then asks the harder question: how do we know the number is any good?

Method of moments: match what you can see

The first method is so natural it feels like common sense, which is exactly its charm. You learned in the probability rung that a distribution has theoretical moments — its mean, its variance — written as formulas in terms of the unknown parameters. You also have a sample, from which you can compute the corresponding sample quantities: the plain average of your data, the spread of your data. The method of moments simply sets them equal and solves. If the theory says the mean equals θ, and your data averages 1,840, then declare θ-hat = 1,840 and move on.

When a distribution has two unknown parameters, you need two equations, so you match the first two moments: set the theoretical mean equal to the sample mean, and the theoretical variance equal to the sample variance, then solve the pair together. Suppose you model annual claim counts as Poisson, whose single parameter λ happens to equal its mean. You observed counts averaging 2.7 over many years. The method of moments shrugs and says: λ-hat = 2.7. Done. Its great virtue is that it almost always gives you an answer with grade-school algebra, even when the fancier method below gets stuck.

The method's weakness is the flip side of its simplicity. It only listens to a couple of summary numbers and ignores the detailed shape of the data, so it can throw away information that a heavy-tailed insurance loss is busy trying to tell you. It can even hand back nonsense — a negative variance estimate, or a parameter outside its legal range — because it never checked whether its answer was plausible. It is the quick, blunt tool: reach for it first, especially as a starting guess, but do not trust it to squeeze every drop of insight from the data.

Maximum likelihood: which dial setting makes the data least surprising?

The second method is deeper, and once it clicks it never leaves you. Imagine you could try every possible value of θ in turn. For each candidate, ask: if θ really were this, how probable would it be to see exactly the data I saw? That number — the probability of the observed data, viewed as a function of the parameter — is called the likelihood. Most candidate values make your particular data look like a freak coincidence; a few make it look entirely ordinary. Maximum likelihood estimation picks the value of θ that makes your actual data the least surprising it could possibly be.

An everyday picture: you find a coin on the ground, flip it ten times, and get seven heads. Which bias would best explain that? A coin that lands heads 10% of the time would make seven-out-of-ten a near-miracle; a 70%-heads coin makes it the single most likely outcome. So maximum likelihood declares the estimated probability of heads to be 0.7 — the value under which what you saw was most expected. The beautiful thing is that this reasoning works for any model: write down the probability of your data as a function of θ, then climb to its peak.

In practice the likelihood is a product of one factor per data point, and products of many small probabilities are numerically nasty, so we maximise its logarithm instead — turning the product into a friendly sum (the log-likelihood). The peak is then found with calculus or, far more often in real work, by letting a computer climb the hill. The reward for this extra effort is real: maximum likelihood listens to the whole dataset, not just a couple of moments, and as samples grow it is provably the most precise consistent estimator there is. It is the workhorse behind fitting loss distributions and the regression models waiting later in this rung.

A tiny worked estimate side by side

Let us make it concrete with the Poisson count model, where the two methods happen to agree — a reassuring place to start. Say four years of data show 2, 4, 3, and 3 claims. The sample mean is (2+4+3+3)/4 = 3. The method of moments matches the Poisson mean λ to this and reports λ-hat = 3. Maximum likelihood, after writing the log-likelihood and finding its peak, lands on exactly the same answer: for the Poisson, the most likely λ is precisely the sample average. Two very different philosophies, one number.

Data (claims per year): 2, 4, 3, 3     n = 4
Method of moments:  set lambda = sample mean
   lambda_hat = (2+4+3+3)/4 = 3
Maximum likelihood (Poisson):
   peak of log-likelihood also occurs at the sample mean
   lambda_hat = 3   <- same answer here, NOT a coincidence for Poisson

Use it: P(0 claims next year) = e^-3 = 0.0498  (about a 1-in-20 quiet year)

For the Poisson the two methods coincide; once you have lambda-hat you can price next year — but everything downstream now rests on an estimate, not a known truth.

Two warnings ride along with that clean answer. First, the methods agree for the Poisson but routinely disagree for skewed loss distributions, where maximum likelihood usually wins by respecting the tail. Second, and more important: λ-hat = 3 is built on a mere four years. Plug it into next year's pricing as if it were carved in stone and you have committed the cardinal sin of forgetting the estimate is itself uncertain. How uncertain? That is the very next question.

What makes an estimate good?

We now have two machines for producing a number. But a machine can produce a bad number confidently, so we need standards. Because an estimate is itself a random variable — its own little distribution over all the samples you might have drawn — we can judge it the way we judge any random variable, with the moments from the previous rung. Three properties matter, and an actuary should be able to recite them.

Unbiased — on average, right. If you repeated the whole study endlessly, the estimates would centre on the true θ, with no systematic lean. Bias is a consistent tilt that more data will never cure, like a scale that always reads two kilos heavy.
Consistent — it homes in. As the sample grows toward infinity, the estimate closes in on the true θ and stays there. This is the law of large numbers wearing a statistician's hat: more data, sharper aim. An estimator can be slightly biased yet still consistent, which is often a fine trade.
Efficient — it wastes nothing. Among the honest estimators, the efficient one has the smallest spread around the truth, so any single sample lands closest. Maximum likelihood's claim to fame is that, for large samples, it is essentially the most efficient there is.

These three live together in one honest summary number: the standard error, which is simply the standard deviation of your estimator — how much θ-hat would wobble if you redrew the sample. A small standard error means your number is trustworthy; a large one is the estimate confessing that it is little more than a rumour. It also shrinks as the square root of the sample size, which is why quadrupling your data only halves your uncertainty — a humbling exchange rate that recurs throughout actuarial work, from credibility to reserving.

Honest cautions before you trust a number

A second caution: a single point estimate, however good its pedigree, hides its own uncertainty by design. Reporting 'λ-hat = 3' with no standard error is like quoting a premium to the cent while privately knowing it could plausibly be anywhere from 2 to 4. This is why a serious actuary almost never reports a bare point estimate; the next guide pairs it with a confidence interval — an honest range — so that the reader sees both the best guess and how firmly to hold it.

So you leave this guide with two reliable ways to wring a parameter out of data — the quick method of moments and the sharper maximum likelihood — and, just as crucial, the three yardsticks (unbiased, consistent, efficient) and the standard error that tell you whether to believe the answer. The pattern from here on never changes: estimate a parameter, attach its uncertainty, then let an honest model carry it into pricing or reserving. Estimation is where statistics finally touches the messy world your probability never had to.