Graphical Models & HMMs

A picture of what depends on what

By now you can write down a joint probability distribution over a handful of variables. But imagine a world with twenty variables, each taking ten values. The full joint table would have ten to the twentieth entries — more numbers than you could ever estimate or store. The whole game of Bayesian inference grinds to a halt not because the math is wrong, but because the bookkeeping is impossible.

A probabilistic graphical model (graphical model) escapes this by exploiting a fact about real systems: most variables don't directly depend on most others. Your morning alarm depends on the day of the week; it does not depend on the weather in another country. If you draw variables as dots and connect only those that directly influence each other, the resulting graph is usually sparse — and that sparsity is exactly what makes the joint distribution factor into small, manageable pieces.

Bayesian networks: arrows that mean "causes"

A Bayesian network is a graphical model whose edges are arrows — a directed graph with no cycles. Each random variable gets one arrow coming in from each of its direct causes, called its parents. The promise of the model is simple and strong: the full joint distribution is just the product of one small table per variable, where each table gives that variable's probability given only its parents.

The classic toy example: it might rain; rain might come from clouds; rain wets the grass; but the grass can also be wet because the sprinkler ran. Draw four dots and three arrows, and you have encoded a real medical-style reasoning engine. Observe wet grass and you can reason backward, using Bayes' theorem, to the probability it rained — and crucially, if you also learn the sprinkler was on, the rain explanation becomes less likely. That pattern, where one cause being confirmed lowers belief in a competing cause, is called explaining away, and a Bayesian network captures it automatically.

Markov random fields: when there is no "first"

Arrows are perfect for cause and effect, but some relationships have no direction. Think of the pixels in an image: a pixel tends to have the same color as its neighbors, yet there is no sense in which the left pixel causes the right one. The influence is mutual. For these, we use a Markov random field — an undirected graphical model, where an edge simply says "these two variables prefer to agree (or relate) somehow," with no winner and no first mover.

Because there is no parent-child order, we cannot just multiply small conditional tables. Instead we assign each clump of connected variables a positive score — higher when the configuration is "agreeable" — multiply all the scores, and divide by a grand normalizing sum so the result is a valid distribution. That normalizing constant, summed over every possible joint configuration, is the price we pay: it is often astronomically expensive to compute exactly. This is why undirected models lean heavily on approximate methods like Markov chain Monte Carlo and especially Gibbs sampling, which sidestep the constant by sampling instead of summing.

The two families are not rivals; they are tools for different shapes of knowledge. Directed networks shine when you have a generative story — first this happens, then that. Undirected fields shine when influence is symmetric and local — pixels, social ties, words in a grid. Many real systems are best drawn with a mix of both.

The hidden Markov model: a chain you cannot see

Now specialize the idea to sequences. A hidden Markov model (HMM) is the simplest interesting graphical model for data that unfolds over time. It posits a chain of hidden states — a latent variable at each time step that you never directly observe — where each state depends only on the one just before it (that is the "Markov" assumption: the future forgets all but the present). At every step the hidden state emits a visible observation, and those observations are all you actually see.

The picture below makes the wiring concrete. Read the top row as the secret truth marching forward in time, and the bottom row as the noisy clues it drops along the way. Classic example: the hidden states are the words a person meant to say, the observations are the garbled audio your phone recorded.

  z1  ->  z2  ->  z3  ->  z4     hidden states (latent)
  |       |       |       |
  v       v       v       v
  x1      x2      x3      x4      observations (visible)

An HMM: each hidden state depends only on the previous one and emits one observation.

An HMM is fully described by three things: how likely each starting state is, a transition table (the chance of moving from one hidden state to another), and an emission table (the chance each state produces each observation). Notice the family resemblance — if you collapse the chain to a single step, an HMM is essentially a mixture model in motion, the temporal cousin of the Gaussian mixture you met earlier, with the mixing weights now evolving from step to step.

Three questions, three answers

What makes the HMM so beloved is that all the questions you'd want to ask have exact, fast answers, thanks to the chain structure. There are really three of them, and it is worth holding them apart in your head.

Likelihood — given a model and a sequence of observations, how probable is that sequence? Solved by the forward algorithm, which sweeps left to right summing over all hidden paths at once instead of enumerating the exponentially many of them.
Decoding — what is the single most likely sequence of hidden states behind the observations? Solved by the Viterbi algorithm, which finds the best path through the chain with simple dynamic programming.
Learning — given only observations and no labels, what transition and emission tables best explain them? Solved by Baum-Welch, which is just the EM algorithm from the previous guide applied to a chain: guess the tables, infer the likely hidden states, re-estimate the tables, repeat.

This is the payoff of structure: the very same EM machinery you used to fit clusters now learns a model of time, with no labeled data at all. HMMs powered a generation of speech recognizers and remain a clean way to do part-of-speech tagging, gene finding, and other forms of sequence labeling like named-entity recognition.

Where they fit, honestly

It would be easy to think modern deep learning made all of this obsolete. It did not — it changed where the lines are drawn. The Markov assumption is genuinely limiting: an HMM forgets everything but the current state, so it struggles with long-range dependencies that a recurrent network or a transformer handles with ease. For raw accuracy on big labeled corpora, neural sequence models usually win.

But graphical models still earn their place for three honest reasons. They are interpretable: the arrows and tables mean something a person can read and argue with. They are data-frugal: a well-structured Bayesian network can reason sensibly from a handful of examples where a deep network would just overfit. And they handle missing data and uncertainty natively — you can observe some variables, leave others blank, and still get a coherent answer with calibrated confidence, not a guess dressed up as certainty.

The deeper lesson outlives any one technique. The two big ideas here — factor a hard problem along a graph of dependencies, and reason about hidden causes from visible effects — reappear everywhere downstream. When exact inference gets too expensive, the next guides reach for variational inference and sampling. And the modern probabilistic toolkit, from variational autoencoders to deep latent-variable models, is in large part these same graphical-model ideas wearing neural-network clothes.