Why bolt uncertainty onto a model at all?
Up the ladder so far, a model has mostly been a function that eats an input and spits out one answer — a class, a number, the next token. That works astonishingly well, but it hides something. Ask a classifier to label a smudged photo and it will still announce "cat, 0.97" with the same swagger it uses on a crisp one. The number looks like confidence, yet it is often just a normalized score, not an honest statement about how likely the answer really is.
The probabilistic view starts from a humbler premise: the world is noisy, your data is partial, and any belief you hold should come with a degree of doubt attached. Instead of predicting a single value, you predict a probability distribution — a whole landscape of possibilities with more weight piled on the plausible ones. The quantity you are predicting is treated as a random variable, something whose value you are uncertain about, not a fact you happen not to know yet.
The likelihood: how well does a model explain the data?
Here is the hinge the whole field swings on. A probabilistic model, with its knobs set to some particular parameters, can be asked a simple question: how probable is the data I actually observed, under you? That number is the likelihood. Flip an unknown coin ten times, see seven heads — a model that claims the coin is 70% heads makes that exact outcome quite probable, while a model claiming 10% heads makes it almost absurd. The data does not change; what changes is how well each candidate model accounts for it.
So a natural way to learn is to hunt for the parameters that make your observed data as probable as possible. That is maximum likelihood estimation, and it is quietly the engine behind a huge swath of machine learning. The famous mean-squared-error you minimize in linear regression is exactly the negative log-likelihood under an assumption of Gaussian noise; the cross-entropy loss of a classifier is the negative log-likelihood of the true labels. Minimizing a loss and maximizing a likelihood are, again and again, the same act seen from two angles.
data: H H T H H T H H H T (7 heads / 10) model p=0.7 likelihood HIGH <- explains the data well model p=0.5 likelihood medium model p=0.1 likelihood ~0 <- this data would be a shock learning = turn the knob p to push likelihood as high as it goes
Two ways to build a model: generative vs discriminative
Once you think in distributions, a fork in the road appears. Suppose you want to tell cats from dogs given a photo. A discriminative model learns only the boundary — the conditional probability of the label given the image, P(label | image). It never bothers to understand what a cat looks like; it just carves the input space into regions. Logistic regression and most neural-net classifiers live here. They are usually the sharpest tool when all you want is the answer.
A generative model is more ambitious. It tries to learn how the data itself is produced — the joint distribution P(image, label), or equivalently a recipe for generating plausible cat-images and dog-images. Naive Bayes is the classic small example: it models how features arise within each class, then uses Bayes' theorem to turn that around into a prediction. Because it has learned to generate, it can also dream up new samples, flag an input that looks like neither class, and cope when labels are scarce.
Neither is simply better — this is a genuine trade-off, not a hierarchy. Modeling the full joint distribution is harder and asks more of your data, so a discriminative model often wins on raw accuracy when labels are plentiful. But the generative framing is what powers the great wave of recent systems: the generative adversarial networks and the diffusion image models that conjure photos from noise are, at heart, models of P(data). And a large language model is squarely generative — it learns the distribution of the next token and samples from it.
From likelihood to belief: the door to Bayes
Maximum likelihood has a blind spot: it listens only to the data and brings no opinion of its own. Flip a coin three times, get three heads, and pure maximum likelihood will declare the coin 100% heads — a wild overreach from a tiny sample, a cousin of the overfitting you already know to fear. Common sense says you should have walked in suspecting the coin was roughly fair, and let three flips nudge that belief only a little.
That is the move Bayesian inference makes formal. You start with a prior — your belief before seeing data — multiply it by the likelihood of the data, and out comes a posterior: your updated belief. The trio of prior, posterior, and likelihood is the grammar of the next several guides. Notice that the prior plays the same role as regularization does in the models you already use: it gently pulls extreme conclusions back toward something reasonable.
And the payoff is exactly the honesty we started out wanting. A Bayesian model does not return one number; it returns a whole posterior distribution, so it can say not just "about 0.6" but "0.6, give or take a lot — I have only seen three flips." That width is genuine uncertainty quantification: the difference between a model that is confidently wrong and one that knows it is guessing.
Honest limits: what probabilities do and don't buy you
It is tempting to read every probability a model emits as a calibrated truth, but be careful. A modern deep network's softmax output is a probability distribution in shape only — these models are notoriously overconfident, happily printing 0.99 on inputs they have never seen anything like. A number near one means the model committed hard, not that it is right one time in a hundred of being wrong. Probabilities are trustworthy only when a model has been checked for calibration, a topic the evaluation rung returns to.
There is a second, subtler trap. A probability is always conditional on the model's assumptions, and a model can only be uncertain in ways it was built to imagine. Quote it a sentence in a language it never trained on, and it will not say "I have no idea" — it will confidently produce nonsense, because the question falls outside the world its distribution covers. This is why hallucination in language models is not a bug to be patched away: a generative model samples fluent, probable-looking text whether or not that text is true, since truth was never part of what it modeled.