Bayesian Inference: Priors, Likelihoods, Posteriors

From a theorem about events to a way of learning

You first met Bayes' theorem back in the conditioning rung as a way to flip a conditional probability: P(A given B) = P(B given A) P(A) / P(B). That was a statement about two events. The leap that powers all of modern data analysis is to apply the very same algebra not to an event but to an unknown number — a coin's true bias, a drug's true effect, a galaxy's true distance — and treat our uncertainty about that number as a full probability distribution. This move is Bayesian inference, and it turns a flip into a learning rule.

Write the unknown as theta. Before seeing any data we encode our belief about theta as a prior distribution, P(theta). The data D arrive, and the likelihood P(D given theta) measures how well each candidate value of theta would have predicted exactly the data we saw. Bayes' theorem combines them into the posterior, P(theta given D) — our belief about theta after the data are in. In words: posterior is proportional to likelihood times prior. The whole engine of this rung sits on that one sentence.

                 P(D given theta) * P(theta)
P(theta given D) = ---------------------------
                          P(D)

  posterior   =   likelihood  x  prior   /   evidence

  P(D) = integral over theta of  P(D given theta) * P(theta)
         (the normalizing constant; just makes the posterior integrate to 1)

The Bayesian update on one card. The denominator P(D) does not depend on theta, so the posterior's SHAPE comes entirely from likelihood times prior.

The likelihood is not a probability over theta

The single most common confusion here is worth pinning down before we compute anything. The likelihood P(D given theta) is read as a function of theta, with the data D held fixed at what you actually observed. As a function of theta it is NOT a probability distribution: it need not integrate to 1, and the area under it means nothing. It only tells you the relative plausibility of each theta in light of the data. The prior supplies the probability statement; the likelihood only re-weights it.

This also clears up a frequentist reflex. The classic maximum likelihood estimate picks the single theta that maximizes the likelihood — the value that best explains the data and nothing more. The Bayesian keeps the whole curve and multiplies it by the prior. If your prior is flat (uniform) over theta, the posterior is just the likelihood renormalized, and its peak sits exactly at the maximum likelihood estimate. So maximum likelihood is the special case of a Bayesian who happened to start with a flat prior and then reports only the peak. The Bayesian who reports the whole posterior is carrying strictly more information.

A worked update: is this coin fair?

Let theta be a coin's probability of heads, somewhere in [0, 1]. We flip it 10 times and see 8 heads. The likelihood for any theta is the binomial probability of that outcome: P(8 heads in 10 given theta) is proportional to theta^8 * (1 - theta)^2, with the binomial coefficient absorbed into the constant we will normalize away. Notice this likelihood peaks at theta = 0.8 — the maximum likelihood estimate — but it is a broad hill, not a spike, because 10 flips is not much evidence.

Now we need a prior. Suppose we genuinely know nothing and use the flat prior P(theta) = 1 on [0, 1]. Then the posterior is proportional to theta^8 * (1 - theta)^2 — the likelihood, renormalized to integrate to 1. The posterior mean works out to (8 + 1) / (10 + 2) = 9/12 = 0.75, slightly pulled back from the raw 0.8 toward the center, because the flat prior quietly contributes the weight of 'one imagined head and one imagined tail'. That gentle shrinkage is a feature: with little data, the prior keeps us from over-committing to 0.8.

Name the unknown and its prior: theta in [0, 1] is the heads-probability; take the flat prior P(theta) = 1.
Write the likelihood for the data you actually saw: 8 heads in 10 gives likelihood proportional to theta^8 * (1 - theta)^2.
Multiply prior by likelihood: posterior proportional to 1 * theta^8 * (1 - theta)^2.
Normalize so the posterior density integrates to 1, then summarize it — here the posterior mean is (8+1)/(10+2) = 0.75.
Answer the real question with the posterior: e.g. compute P(theta > 0.5 given D) to gauge how sure we are the coin favors heads.

Conjugate priors: when the algebra disappears

Computing that normalizing integral by hand is painful in general, and in higher dimensions it is the central difficulty that the next guides on Monte Carlo and MCMC exist to defeat. But for the coin there is a magic shortcut. Replace the flat prior with a Beta distribution, Beta(a, b), whose density is proportional to theta^(a-1) * (1 - theta)^(b-1). Multiply it by the binomial likelihood theta^h * (1 - theta)^t (with h heads and t tails) and the exponents simply add: the posterior is proportional to theta^(a + h - 1) * (1 - theta)^(b + t - 1), which is exactly Beta(a + h, b + t).

That is the idea of a conjugate prior: a prior family that, paired with a given likelihood, hands back a posterior in the same family. The update becomes pure bookkeeping — add the observed heads to a, add the observed tails to b, and you are done; no integral required. The flat prior we used earlier is just Beta(1, 1) in disguise, which is why our posterior mean came out as (8+1)/(10+2): the Beta(a, b) mean is a/(a+b), so after the update it is (a+h)/(a+b+h+t) = (1+8)/(1+1+8+2). The prior parameters a and b read as 'pseudo-counts' of heads and tails you are pretending to have seen before the real data.

Reading the posterior, and being honest about the prior

The posterior is the answer, but you usually need to summarize it. A point summary like the posterior mean is one number; a credible interval is far more informative. A 95% credible interval is simply an interval that holds 95% of the posterior probability, so you can say plainly 'given the data and prior, there is a 95% probability that theta lies in here'. That direct probability statement about the parameter is exactly what the frequentist confidence interval does NOT give you — a subtle but real distinction we will sharpen in the statistics guide.

Often what you really want is a prediction, not the parameter. The posterior predictive distribution answers 'what will the NEXT flip do?' by averaging the likelihood of a new outcome over the whole posterior, not over a single best-guess theta. For our coin after 8-of-10 heads with a flat prior, the predictive probability that the next flip is heads is the posterior mean, 0.75 — and crucially it accounts for our remaining uncertainty about theta, rather than pretending we know theta = 0.8 exactly. Plugging in a single point estimate would understate how uncertain the prediction really is.

Finally, the honest caveat that defines the whole method. The posterior depends on the prior, so a Bayesian conclusion is only as defensible as the prior behind it — choose a prior badly and you can bias the answer, and there is no fully 'objective' choice that everyone agrees on. The practical defenses are real: with plenty of data the likelihood dominates and reasonable priors converge to nearly the same posterior; you can run a sensitivity check across several priors; and stating the prior openly makes your assumptions auditable rather than hidden. The Bayesian-versus-frequentist debate is, at heart, an argument about exactly this: whether putting a probability on an unknown constant is a strength or an overreach. Both camps compute the same likelihood — they part ways on whether theta deserves a distribution of its own.