JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

EM & Latent Variables

What if the most important variable in your data is the one you never get to see? This guide follows that single idea — the hidden cause — from clustering with mixture models, through the EM algorithm, to variational inference and the modern VAE.

The Variable You Never Measure

By now you have met models that learn a mapping from inputs to outputs, and the Bayesian habit of carrying a prior, likelihood, and posterior through every prediction. This guide adds one deceptively small idea that reorganizes a huge amount of machine learning: sometimes the variable that best explains your data is one you never actually recorded. Customers have a 'segment' you never asked about; an email has a 'topic' nobody labeled; a recording of speech has a sequence of phonemes hidden inside the waveform. We call such an unobserved cause a latent variable.

Latent variables are not a trick or an approximation — they are how we encode a belief about structure we cannot see directly. We assume the data was generated in two steps: first nature draws a hidden value, then it produces the observation we actually measure. Most of the machinery in this rung exists to answer one hard question that follows: if we can only ever see the second step, how do we reason about the first?

Mixture Models: A Crowd of Hidden Sources

The cleanest example is a mixture model. Picture the heights of a thousand people you never sorted by anything. The histogram has two soft bumps. A single bell curve fits it badly, but two bell curves — one centered lower, one higher — fit it beautifully. The latent variable here is simply which bump each person belongs to. When those component curves are Gaussians, this is the Gaussian mixture model, or GMM: each data point is assumed to come from one of K clusters, but the label saying which one is hidden.

Here is the chicken-and-egg knot. If someone handed you the hidden labels, fitting each Gaussian would be trivial — just average the points in each group. And if someone handed you the Gaussians, guessing each point's label would be easy — assign it to whichever curve makes it most likely. But you have neither. You want to find both the cluster shapes and the soft assignments at the same time, using nothing but the unlabeled heights. Plain maximum likelihood cannot be solved in one clean step, because the likelihood sums over every possible hidden assignment.

Expectation–Maximization: Guess, Refine, Repeat

The escape from the chicken-and-egg knot is to stop demanding both answers at once and instead bootstrap. This is the expectation–maximization algorithm, or EM. Start with a wild guess for the cluster shapes. Then alternate two moves. In the E-step, hold the shapes fixed and compute, for every point, the probability it came from each cluster — these soft responsibilities are your current best belief about the hidden labels. In the M-step, hold those soft labels fixed and re-estimate each cluster's center and spread, weighting every point by how much it 'belongs.'

  1. Initialize: pick rough starting values for the K cluster means, spreads, and mixing weights — even random ones work.
  2. E-step: with the current clusters fixed, compute each point's soft responsibility — its probability of belonging to each cluster.
  3. M-step: with responsibilities fixed, re-estimate each cluster's mean, spread, and weight as a responsibility-weighted average of the points.
  4. Repeat E and M until the fit stops improving — the data's likelihood is guaranteed never to go down.

Each round nudges the clusters toward a configuration that explains the data better, and a lovely theorem guarantees the data's likelihood never decreases from one round to the next. But be honest about what that buys you: EM climbs to a *local* optimum, not necessarily the best one. A bad initialization can leave it stuck in a mediocre solution, which is why practitioners run it several times from different starting points and keep the best. EM is not magic that finds hidden truth; it is a disciplined, monotonic hill-climb over a landscape you could not otherwise search.

When the Hidden Side Gets Too Big: Variational Inference

EM's E-step needs the exact posterior over the latent variable — the precise probability of each hidden value given the data. For a GMM that posterior is a tidy little table over K clusters, so the E-step is cheap. But suppose the hidden side is a high-dimensional continuous vector, with the observation produced by a tangled neural network. Now the exact posterior is an intractable integral — there is no clean formula, and summing over all possibilities is hopeless. EM's clean E-step quietly breaks.

Variational inference is the rescue. Instead of computing the impossible true posterior, we pick a simpler family of distributions we *can* handle — say, plain Gaussians — and search within that family for the member that hugs the true posterior as tightly as possible. We have turned inference into optimization: tune the parameters of our stand-in distribution to minimize its distance from the truth. The 'distance' we minimize is the KL divergence, the standard measure of how far one distribution sits from another.

The VAE: Latent Variables Meet Deep Learning

Put variational inference together with neural networks and you get the variational autoencoder, or VAE — one of the cleanest bridges between this rung's probabilistic ideas and the deep learning you already know. The picture is two networks facing each other. An encoder reads an image and, instead of outputting one code, outputs a *distribution* over a latent vector — a mean and a spread. A decoder samples a point from that distribution and tries to rebuild the original image from it.

x  --[encoder]-->  q(z|x) = mean, spread
                       |  sample z
                       v
z  --[decoder]-->  reconstruction of x

loss = reconstruction error  +  KL( q(z|x) || prior )
The VAE in one sketch: encode to a distribution, sample a latent z, decode it back. The KL term keeps the latent space tidy.

The training objective is the ELBO from the last section, wearing work clothes. It has two terms pulling against each other. The reconstruction term rewards the decoder for rebuilding the input faithfully. The KL term pulls every encoded distribution toward a simple shared prior — usually a standard Gaussian — so the latent space does not fragment into disconnected islands. That second pressure is what makes a VAE *generative*: because the latent space is smooth and well-organized, you can throw away the encoder, draw a fresh random z from the prior, run it through the decoder, and get a brand-new sample that never existed in the training set.

It helps to see the whole arc as one idea wearing three costumes. The GMM is a latent-variable model whose hidden variable is a discrete cluster label. The VAE is a latent-variable model whose hidden variable is a continuous vector and whose decoder is a deep network. EM and variational inference are the two ways of doing the same job — reasoning about the hidden side — one exact and small, one approximate and scalable. Same skeleton, different muscles.

What This Buys You — and What It Doesn't

The honest framing is worth keeping. A latent variable is a *modeling assumption*, not a discovered fact. When a GMM splits your customers into three clusters, it found three clusters because you asked for three and because Gaussians were a convenient story — not because nature stamped exactly three types of people. The latent structure is real only to the extent your assumptions match the world, and a different K or a different distribution family would tell a different tale. Treat the recovered clusters as a useful lens, not a law.

And mind the words. A VAE that generates faces is not 'imagining' or 'understanding' anything — it is sampling from a smooth latent space shaped to make reconstructions plausible. Its blurry, slightly-too-smooth outputs are a direct fingerprint of the approximations we made: a simple prior, a chosen family, a bound rather than the exact likelihood. The payoff of all this machinery is genuine and concrete, though. You get a way to learn from unlabeled data, to compress observations into a meaningful latent code, to generate new samples, and — crucially for this whole rung — to attach honest uncertainty to your beliefs about the hidden side, rather than pretending you know it exactly.