Bayes' Theorem & Updating Beliefs

Probability that depends on what you know

You already met a random variable and a probability distribution earlier in this rung. Bayes' theorem is what happens when you take that machinery and ask a very human question: *now that I have seen something, how should I update what I believe?* That single move — turning fresh evidence into a revised belief — is the heartbeat of an enormous amount of statistics and machine learning.

The first idea you need is conditional probability: the chance of one thing *given that* another thing is true. We write it P(A | B), read "the probability of A given B." The vertical bar means "in the world where B has already happened." For example, the chance a random person owns an umbrella is one number; the chance they own an umbrella *given that it is raining outside* is usually higher. Conditioning is just narrowing the world down to the cases that match what you observed, then asking your question inside that smaller world.

Three words: prior, likelihood, posterior

Bayesian thinking splits any updating problem into three named pieces. The prior is what you believed *before* the new evidence: a probability you assign to a hypothesis up front. The likelihood is how well that hypothesis explains the evidence you actually saw: if the hypothesis were true, how probable was this observation? The posterior is your *revised* belief after combining the two. These three together, prior, likelihood, and posterior, are the whole vocabulary of the method.

A useful mental image: the prior is your starting position, the likelihood is the tug that the evidence applies, and the posterior is where you end up after being tugged. Strong evidence (a very lopsided likelihood) pulls hard and can overturn a modest prior. Weak or ambiguous evidence barely moves you, so the posterior stays close to the prior. Nothing is ever fully certain — you always end up with a *distribution* of belief, not a verdict carved in stone.

When this loop runs again and again — today's posterior becoming tomorrow's prior as more data arrives — you have the engine of Bayesian inference. It is a disciplined version of how careful people already reason: hold a tentative view, watch what happens, and shift in proportion to how surprising the evidence was.

The formula, and a worked example

Here is Bayes' theorem itself, with the names attached so it never looks like a magic spell:

P(H | E) = P(E | H) * P(H) / P(E)

  P(H | E)  posterior  -> belief AFTER seeing evidence
  P(E | H)  likelihood -> how well H explains E
  P(H)      prior      -> belief BEFORE the evidence
  P(E)      evidence   -> total chance of seeing E at all

Posterior = likelihood times prior, divided by the overall chance of the evidence. The bottom is just a normalizer that makes the numbers add up to 1.

Let's make it concrete with the classic medical-test puzzle, because the answer surprises almost everyone. A disease affects 1 in 1000 people. A test catches 99% of true cases (its likelihood when you are sick), but it also has a 5% false-positive rate (it wrongly fires for 5% of healthy people). You test positive. What is the chance you actually have the disease?

Prior: before the test, P(sick) = 0.001, so P(healthy) = 0.999.
Likelihood of a positive if sick = 0.99; likelihood of a positive if healthy = 0.05.
Evidence (total chance of a positive) = 0.99*0.001 + 0.05*0.999 = 0.00099 + 0.04995 = 0.05094.
Posterior: P(sick | positive) = 0.00099 / 0.05094 ≈ 0.019 — under 2%.

Read that again: a positive result on a *99%-accurate* test still leaves you about 98% likely to be fine. The reason is the prior. The disease is so rare that the few false positives among the huge healthy crowd vastly outnumber the true positives. Bayes' theorem is what forces you to keep the base rate in view — ignore the prior and you will wildly overreact. (One positive is not nothing, though: your belief jumped from 0.1% to about 2%, a twentyfold rise, which is exactly why a doctor orders a second, independent test.)

Maximum likelihood: the prior's quiet cousin

Bayes asks for a posterior, but doing that honestly means committing to a prior — and sometimes you would rather not. Maximum likelihood takes a leaner route. Instead of asking "what should I believe given the data," it asks "which setting of the unknowns would have made the data I saw most probable?" You pick the hypothesis that maximizes the likelihood, and you stop there. This is maximum likelihood estimation, and it is everywhere.

A tiny example: you flip a bent coin 10 times and get 7 heads. What is the coin's true probability of heads? Maximum likelihood answers 0.7 — the value under which "7 of 10" was most probable. It is exactly the answer your gut already gave. The point is that the *gut feeling* turns out to be a precise mathematical procedure, and that procedure generalizes to models with millions of unknowns where intuition fails.

Here is the bridge back to Bayes that ties this rung together. Maximum likelihood is just Bayes with the prior switched off — assume every hypothesis is equally plausible up front, and the posterior is driven purely by the likelihood. That is honest when you truly have no prior knowledge, and risky when you do: with very little data, ignoring a sensible prior lets a small fluke (like the bent coin landing 7 heads by luck) be taken too literally.

Where this shows up in machine learning

This is not a detour — it is the spine of how models learn. When you train almost any classifier or regressor, the loss function you minimize is, very often, just the negative log-likelihood in disguise. "Find parameters that make the training data most probable" and "find parameters that minimize the loss" are frequently the *same sentence* written two ways. So the maximum-likelihood idea you just met quietly powers a huge share of training.

Bayes' theorem also stars directly. The cheerful little classifier called naive Bayes applies the formula straight to text and works shockingly well for spam filtering — it just multiplies word likelihoods and lets a prior over spam-vs-not tip the balance. And the spam filter, the medical test, and a fraud detector all share the same trap: when the thing you are hunting for is rare, even an accurate detector drowns in false alarms unless you respect the base rate.

One honest caveat to carry forward. A posterior is only as trustworthy as its inputs: a biased prior or a mis-specified likelihood gives a confidently wrong answer, dressed up in the authority of a number. Bayes does not manufacture certainty out of thin air — it bookkeeps the certainty you fed in. The modern interest in uncertainty quantification exists because real systems must say not just *what* they predict but *how sure* they are, and a probability that has not been checked against reality can be dangerously overconfident.