From a single formula to a way of thinking
You already met Bayes’ theorem as a line of algebra: flip a conditional probability around. Here we take it seriously as the engine of learning. The whole of Bayesian inference rests on one move — start with what you believed before seeing data, weigh it against how well each possibility explains the data you actually got, and end with what you should believe after. That before-and-after is the heart of reasoning under uncertainty.
Three nouns carry the whole story, so let us name them once and keep them. The prior is your belief about an unknown quantity before this batch of data. The likelihood is how probable the observed data is, for each candidate value of that unknown. The posterior is the updated belief that results. Bayes’ rule simply says: posterior is proportional to likelihood times prior.
A worked example: is this coin fair?
Take a concrete unknown: a coin's bias, the long-run probability θ of heads. Before flipping, what do you believe about θ? A reasonable prior might be "probably near 0.5, but I'm not certain" — a gentle hump centered on a half. That belief is itself a probability distribution over θ, not a single number, because you are honestly uncertain.
Now you flip ten times and see 7 heads, 3 tails. The likelihood asks: for each possible θ, how probable was *exactly this outcome*? A coin with θ near 0.7 makes 7-out-of-10 quite likely; a coin with θ near 0.1 makes it almost impossible. Multiply that likelihood curve by your prior hump, renormalize so it sums to one, and you have the posterior — your sharpened belief about θ after the data.
posterior(θ) ∝ likelihood(data | θ) × prior(θ)
prior: broad hump around 0.5
likelihood: peaks near 0.7 (because 7/10 heads)
posterior: a hump pulled toward 0.7, narrower
than the prior — sharper, not certainNotice what just happened. The posterior did not jump all the way to 0.7. It landed somewhere between your prior's 0.5 and the data's 0.7, because ten flips is weak evidence. Flip a thousand times and the data would overwhelm the prior, dragging the posterior almost entirely onto whatever the coin truly is. More data shrinks the prior's influence — and that is not a bug, it is the whole point.
Conjugate priors: when the math stays tidy
There is a practical headache lurking in "renormalize so it sums to one." For most priors that normalizing constant is a hard integral with no closed-form answer, which is why later guides reach for sampling and approximate inference. But for certain lucky pairings the algebra closes up neatly. These are conjugate priors: a prior whose shape is *preserved* by the likelihood, so the posterior comes out as the same family of distribution, just with updated numbers.
Our coin is the classic case. A Beta distribution is the conjugate prior for coin-flip (Bernoulli) data. Write the prior as Beta(a, b), where you can read a and b as "imagined heads" and "imagined tails" you pretend to have seen before any real flips. After observing h heads and t tails, the posterior is simply Beta(a + h, b + t). No integral — you just add your counts to the prior's counts.
prior = Beta(a, b) # a heads, b tails imagined data = h heads, t tails # actually observed posterior = Beta(a + h, b + t) # just add the counts! # Beta(2,2) + 7 heads, 3 tails -> Beta(9, 5) # posterior mean = 9 / (9+5) ≈ 0.64
MAP: collapsing a belief into one answer
A full posterior is a distribution, but sometimes you must hand over a single best guess — to display a number, or to plug into the next stage of a pipeline. The natural choice is the peak of the posterior: the value the data and your prior, together, find most plausible. That is maximum a posteriori estimation, MAP for short.
MAP has a beautiful relationship to something you already know well: maximum likelihood estimation, MLE. MLE picks the value that makes the data most probable — it listens *only* to the data. MAP picks the value that maximizes likelihood times prior — it listens to the data *and* your prior knowledge. So MAP is MLE plus a prior; and as data grows, the prior's voice fades and the two estimates converge.
This connection is not just elegant trivia — it quietly explains a tool from earlier rungs. Adding L2 weight penalty to a model (ridge regularization) is *exactly* MAP estimation with a Gaussian prior on the weights, the prior saying "weights are probably small." L1 penalty (lasso) is MAP with a different prior favoring sparse weights. Regularization, which you met as a trick against overfitting, turns out to be a Bayesian prior wearing a disguise.
Updating as a habit, and what it buys you
The deepest practical virtue of Bayes is that updating is *recursive*: today's posterior becomes tomorrow's prior. You never need to relive the whole history of data. Believe Beta(9, 5) after the first ten flips, then see four more heads — update to Beta(13, 5). This is how a system can learn continuously from a stream, folding each new observation into a single running belief.
- State your prior. Write down what you believe about the unknown before this data — and be honest about how uncertain you are. A vague prior is a wide distribution.
- Write the likelihood. Choose a model that says how probable the observed data is for each possible value of the unknown.
- Multiply and normalize. Combine prior and likelihood, then rescale to get the posterior — exactly, if conjugate; approximately, otherwise.
- Summarize honestly. Report a credible interval, not just a single point, so the uncertainty travels with the answer.
- Loop. When new data arrives, today's posterior is tomorrow's prior. Go back to step two.
That fourth step is the real prize. Because the posterior is a full distribution, a Bayesian model can report a credible interval — "I'm 95% sure θ is between 0.41 and 0.83" — instead of a bare point estimate. After seeing only ten flips that interval is wide, which is the model telling you the truth: it does not yet know much. This honest accounting of doubt is what the rest of this rung builds on.
Honest cautions before you go
Bayes is principled, not magic. Two cautions keep it honest. First, the prior is a genuine modeling choice, and a confident wrong prior with too little data can mislead you — Bayes does not invent information you did not put in. Second, "95% credible" is only as trustworthy as your model; if the likelihood is misspecified, the posterior is precisely confident about the wrong thing.
Still, the framework you now hold is genuinely powerful. Prior, likelihood, posterior; conjugacy for clean updates; MAP as the bridge to the maximum likelihood and regularization you already know. From here the rung opens up: when the integrals get hard, approximate methods take over, and when many unknowns interact, graphical models give Bayes a structure to climb. You have the grammar; the rest of the rung is vocabulary.