Markov Chain Monte Carlo: Sampling the Intractable

The wall plain Monte Carlo hits

The previous guide showed how Monte Carlo turns hard integrals and probabilities into something you can estimate by averaging over draws: if you can sample X1, X2, ..., Xn from a distribution, then E[f(X)] is approximately the average of f(Xi), and the law of large numbers (its plain i.i.d. form) makes that average converge. But notice the silent assumption hiding in plain sight: *you can draw the samples in the first place*. For a Normal or an exponential, fine — the inverse-transform method or a clever recipe gives you draws on demand. The trouble starts when the distribution you care about is one you can only describe, not sample from.

Bayesian inference, from the first guide of this rung, hands you exactly that kind of distribution. The posterior is P(theta given data) = P(data given theta) P(theta) / P(data), and the denominator P(data) — the evidence — is an integral over all values of theta that is usually impossible to compute in closed form. So you know the posterior's *shape* perfectly: it is proportional to likelihood times prior. You just do not know the normalizing constant that makes it integrate to 1. Writing q(theta) = P(data given theta) P(theta), you have the posterior = q(theta) / Z where Z is unknown. How do you sample from a curve when you cannot even measure how much area sits under it?

The idea: build a chain that lives where you want

Here MCMC borrows from the Markov-chains rung. Recall that a well-behaved Markov chain — irreducible and aperiodic — has a unique stationary distribution pi, the solution of pi P = pi, and the chain forgets where it started and settles into pi: in the long run, the fraction of time spent in each state matches pi exactly. MCMC turns that fact upside down. Instead of being handed a chain and asked for its stationary distribution, we are handed the distribution we want (the posterior) and asked to *design a chain whose stationary distribution is that very target*. Then we simply run the chain and read off where it goes.

Once such a chain exists, the run gives you a long trajectory theta_1, theta_2, theta_3, ..., and after an initial burn-in stretch (the early steps still tainted by the arbitrary starting point), the remaining states are effectively samples from the posterior. They are not independent — each step nudges from the last — but a Markov-chain version of the law of large numbers, the ergodic theorem, guarantees that time-averages along the trajectory still converge to the true posterior expectation E[f(theta)]. So Monte Carlo estimation works exactly as before; we have only swapped independent draws for a dependent-but-correctly-distributed walk.

Detailed balance: the engineer's shortcut

How do you design a chain to have a prescribed stationary distribution? Solving pi P = pi directly for P is hopeless — that is one equation with a whole matrix of unknowns. The trick is a stronger, much easier-to-arrange condition called detailed balance: a chain satisfies it if pi(x) P(x to y) = pi(y) P(y to x) for every pair of states. In words, at equilibrium the flow of probability from x to y exactly cancels the flow back from y to x — the chain is *reversible*. Detailed balance is sufficient (not necessary) for pi to be stationary: sum detailed balance over all x and the pi P = pi equation falls out for free.

And here is the payoff that solves our normalizing-constant problem. Detailed balance involves pi only through *ratios*, pi(y) / pi(x). Since the posterior is q(theta) / Z with the same unknown Z on top and bottom, that Z cancels: pi(y) / pi(x) = q(y) / q(x). We never need Z. The whole design can be done knowing the target only up to a constant — which, for a Bayesian posterior, is precisely all we have. Detailed balance is what lets MCMC sidestep the intractable evidence integral entirely.

Detailed balance:   pi(x) P(x->y) = pi(y) P(y->x)

Sum over x:         sum_x pi(x) P(x->y) = pi(y) sum_x P(y->x)
                                        = pi(y) * 1
  =>  (pi P)(y) = pi(y)    for every y      i.e.   pi P = pi

Key:  pi appears only as the ratio  pi(y)/pi(x)  =  q(y)/q(x)
      so the unknown normalizer Z in pi = q/Z cancels.

Detailed balance implies stationarity, and depends on the target only through ratios — so Z drops out.

Metropolis-Hastings: propose, then accept or reject

The Metropolis-Hastings algorithm is a recipe that builds a detailed-balance chain out of almost any proposal you like. You supply a proposal distribution g(y given x) that suggests a candidate next state y given the current x — for example, "jitter the current theta by a small Normal step". Then you accept or reject that candidate with a carefully chosen probability, and that acceptance step is what tilts the dynamics until detailed balance holds for your target. The genius is that the awkward proposal can be anything reasonable; the acceptance rule cleans up after it.

Start anywhere: pick an initial theta_0 in the parameter space (a rough guess is fine).
Propose a candidate y by drawing from g(y given current x) — e.g. a small Normal jump around x.
Compute the acceptance ratio a = min(1, [q(y) g(x given y)] / [q(x) g(y given x)]). Note q appears only as a ratio, so Z cancels.
Draw a uniform u in [0,1]: if u < a, move to y (accept); otherwise stay at x (reject) and record x again.
Repeat for many steps; discard the burn-in, then treat the kept states as posterior samples.

When the proposal is symmetric — g(y given x) = g(x given y), as a Normal jitter is — the g terms cancel and the ratio collapses to the original Metropolis form a = min(1, q(y) / q(x)). Read it plainly: if the candidate has higher target density, always accept; if lower, accept with probability equal to the ratio of densities. So the chain eagerly climbs toward high-probability regions but still, every so often, steps downhill — and that willingness to occasionally go downhill is exactly what stops it getting stuck on one peak and lets it map the *whole* posterior. A practical tuning point: the step size matters a lot. Steps too big are almost always rejected (the chain freezes); steps too small are always accepted but crawl (high correlation). A rule of thumb aims for an acceptance rate somewhere in the rough neighbourhood of 20-50%.

Gibbs sampling: when you can do one coordinate at a time

Many real models have several parameters — say theta = (a, b, c) — and although the joint posterior is intractable, each parameter's *conditional* distribution given all the others is something you can sample from directly (often because that conditional turns out to be a familiar conjugate form). The Gibbs sampler exploits exactly this. It sweeps through the coordinates one at a time, redrawing each from its full conditional while holding the rest fixed: draw a_new from P(a given b, c), then b_new from P(b given a_new, c), then c_new from P(c given a_new, b_new), and loop. No tuning, no rejections — every proposal is accepted, because Gibbs is secretly a special case of Metropolis-Hastings whose acceptance probability is always exactly 1.

Gibbs is wonderfully clean when the conditionals are easy, which is why it powered much early applied Bayesian work. But it has a real weakness worth naming honestly: when parameters are strongly correlated in the posterior, updating one coordinate at a time forces the chain to inch along a narrow diagonal ridge in tiny axis-aligned steps, so it mixes painfully slowly. There is also no free lunch on the universal headaches: like all MCMC, both Gibbs and Metropolis-Hastings can converge slowly, can be fooled into missing a far-off mode of a multimodal posterior, and need real diagnostics — running several chains from different starts and checking they agree — before you trust the output. Correct in the limit is not the same as correct after the run you actually did.