A Gentle Tour of Probability

Why machine learning speaks the language of chance

In the vectors guide you learned to think of data as arrows in space, crisp and exact. Real data is never that clean. Two photos of the same cat differ; the same survey asked twice gives different answers; a sensor jitters. Probability is the branch of math built precisely for this: a careful way to reason when you cannot be certain, and to say *how* uncertain you are with a number between 0 and 1.

This is why nearly all of machine learning is, under the hood, probabilistic. A classifier does not declare "this is a cat." It outputs something closer to "85% cat, 12% dog, 3% other." A spam filter weighs evidence. A language model picks the next word by sampling from a list of likelihoods. The model is always making its best bet under uncertainty — and probability is the bookkeeping that keeps those bets honest.

Random variables: putting numbers on uncertainty

A [[random-variable|random variable]] is just a quantity whose value isn't fixed in advance because it comes from some uncertain process. Roll a die: the result X is a random variable that can be 1 through 6. Measure tomorrow's temperature: that's a random variable too. The word "variable" is a little misleading — it doesn't vary on its own; it takes whatever value the underlying chance hands it, and we study the *pattern* of those values.

Random variables come in two flavors. Discrete ones take separate, countable values — a die (1–6), the number of emails today, which of three classes a model picks. Continuous ones can take any value in a range — a height, a temperature, the weight inside a neural network. The distinction matters because it changes how we describe their behavior, which is the next idea: the distribution.

Here is the connection to the rest of the ladder: every model input is best thought of as a sample from some random process, and every model output is a random variable too. When you hear that a dataset is "drawn from a distribution," it means each row is one roll of a very complicated die that the real world keeps tossing.

Distributions: the shape of what's likely

A [[probability-distribution|probability distribution]] is the full recipe for a random variable: it tells you, for every possible value, how much probability lands there. For a fair die the distribution is flat — each face gets 1/6. For a loaded die it might pile probability onto 6. The one ironclad rule: all the probabilities must add up to exactly 1, because *something* always happens.

For continuous variables we draw a smooth curve instead of bars, and here intuition needs one gentle correction: the height of the curve is *not* a probability. The probability lives in the area under the curve between two points. So for a continuous variable the chance of any single exact value (a height of precisely 170.000... cm) is effectively zero — we can only ask about ranges, like "between 165 and 175." The total area under the whole curve is, again, 1.

The Gaussian: nature's favorite bell

One distribution shows up so often it earns its own name: the [[normal-distribution|normal distribution]], also called the Gaussian or the bell curve. It is the symmetric hump you have seen everywhere — heights, measurement errors, exam scores. Most values cluster near a central peak, and they thin out smoothly and symmetrically on both sides. Two numbers fully describe it: where the peak sits, and how wide the bell is.

Why is it everywhere? A deep result called the Central Limit Theorem says that when many small, independent random effects add up, their total tends toward a Gaussian — almost regardless of what each little effect looked like. Human height is many genes and meals summed together; measurement noise is many tiny jitters. That "sum of many small things" pattern is so common that the bell curve becomes the natural default — and it is why so many ML methods quietly *assume* their noise is Gaussian.

Expectation and variance: two numbers that summarize a cloud

A whole distribution is a lot to carry around, so we boil it down to two summary numbers. [[expectation-and-variance|Expectation]] (also called the mean or expected value) is the long-run average: the value you'd settle on if you ran the random process a zillion times and averaged. It is each outcome weighted by how likely it is — for a fair die, (1+2+3+4+5+6)/6 = 3.5.

The second number is variance: how far outcomes typically wander from that average. Small variance means results huddle tightly near the mean (predictable); large variance means they fling themselves all over (volatile). Because variance is built from squared distances its units are awkward, so we usually take its square root — the standard deviation — which lives in the same units as the data and is easier to picture. Two distributions can share the exact same mean yet feel completely different: one a narrow spike, the other a flat sprawl.

These two ideas are load-bearing in ML. The expected error of a model is exactly what training tries to minimize. The famous bias–variance tradeoff — the reason models overfit or underfit — is named straight after variance. And stochastic gradient descent, which you'll meet later, works precisely because a slope estimated from a small random batch has the *right expectation* even though any single batch is noisy.

From distributions to learning: likelihood and updating beliefs

Now the payoff. Suppose you have data and you suspect it came from a Gaussian, but you don't know its mean and variance. [[maximum-likelihood-estimation|Maximum likelihood estimation]] flips the question around: of all the bell curves you could draw, which one makes the data you actually saw look *most probable*? You pick the distribution under which your data is least surprising. That single idea — choose the parameters that best explain what you observed — is the engine behind a huge share of model training.

# pick the parameters that make the observed data most likely
best = argmax over theta of  P(data | theta)

# Bayes' rule: turn it around to update a belief
P(theta | data)  =  P(data | theta) * P(theta) / P(data)
#   posterior     ~     likelihood   *  prior

Two ways to learn from data: maximum likelihood picks the single best explanation; Bayes' theorem updates a prior belief into a posterior.

There's a second, complementary view. [[bayes-theorem|Bayes' theorem]] is a rule for *updating* a belief when new evidence arrives: start with a prior (what you believed before), weigh it by how well the new data fits, and end with a posterior (your revised belief). A spam filter does exactly this — it begins with how common spam is, then nudges that estimate as it reads the words in your email. It is the mathematics of changing your mind in just the right amount.

You don't need to memorize the formulas yet — that's later rungs' work. What matters now is the worldview: data is sampled from distributions, models are guesses about those distributions, and learning means tuning the guess so the observed data becomes as unsurprising as possible. Carry that one sentence forward and the probabilistic heart of machine learning will keep making sense.