Multi-Step Transitions and Chapman-Kolmogorov

From one step to two: summing over the middle

In guide 1 you met the [[transition-matrix|transition matrix]] P, whose entry P(i, j) is the probability of jumping from state i to state j in a single step. That single number answers "where next?" But almost every real question about a Markov chain is about the longer run: a board game asks where your token will be in three turns, a weather model asks the chance of rain a week from now. So we want a new quantity, the two-step transition probability, written P^(2)(i, j): the probability that the chain, starting in i, is in j exactly two steps later — no matter what it did in between.

Here is the one idea that unlocks everything. To get from i to j in two steps, the chain must pass through *some* intermediate state k at the halfway point — and exactly one such k happens on any given trip. The intermediate states are mutually exclusive and exhaustive, so by the law of total probability we add up the probability of every route i -> k -> j. The probability of one such route is P(i, k) times P(k, j): the first factor is the chance of the first hop, the second is the chance of the second hop. Crucially, we may multiply them because the Markov property says that once you have landed at k, the past i is forgotten — the second hop depends on k alone.

P^(2)(i, j) = sum over all k of  P(i, k) * P(k, j)

             first hop  ---^         ^--- second hop
             i -> k                  k -> j

The two-step probability is a sum over every possible middle state k — every route i -> k -> j.

That sum is matrix multiplication in disguise

Look hard at that formula: "sum over k of P(i, k) times P(k, j)". If you have met matrix multiplication on the linear-algebra rungs, your pulse should quicken — that is *exactly* the recipe for the (i, j) entry of the product of P with itself. In other words, the matrix of two-step probabilities is simply P^2 = P times P. We did not have to invent a new operation; the algebra you already know hands it to us for free. This is the small miracle that makes Markov chains so computable.

The pattern keeps going with no new effort. To go from i to j in n steps, condition on where you are after the first n - 1 steps (or equivalently after the first step) and add up. The matrix that collects all the n-step probabilities is the [[n-step-transition-matrix|n-step transition matrix]], and it is nothing but the n-th power of P: the matrix P^(n) equals P^n. Want the chance of rain seven days out? Raise P to the seventh power and read off the row for today's weather. The geometry of a long random journey has been compressed into one tidy operation: repeated squaring of a single small matrix.

Chapman-Kolmogorov: splitting a journey anywhere

The two-step argument was a special case of a far more general truth, and it deserves its full name. The [[chapman-kolmogorov-equations|Chapman-Kolmogorov equation]] says you may break an (m + n)-step journey at *any* intermediate time you like. Pick a cut point: the chain must be in some state k right then, so P^(m+n)(i, j) = sum over k of P^(m)(i, k) times P^(n)(k, j). In matrix form this is the almost embarrassingly simple statement P^(m+n) = P^m times P^n — the law of exponents for matrices. Two steps used the cut m = n = 1; the equation lets you cut anywhere.

Why does it matter that we can cut *anywhere*? Because it is the formal engine behind every later result in this rung. When we ask whether a chain settles into a long-run equilibrium, we study what P^n does as n grows, and Chapman-Kolmogorov is what lets us relate P^(2n) to P^n. When we classify states as recurrent or transient in the next guide, the test is whether the entries P^(n)(i, i) sum to infinity — a question only the n-step probabilities can pose. The equation is not a one-off trick; it is the grammar in which the whole subject is written.

Where the chain actually is: distributions, not single points

So far we fixed a starting state i. But usually we are unsure where the chain begins — we only have an [[initial-distribution|initial distribution]], a row vector pi_0 whose entries are the probabilities of starting in each state and which must sum to 1. The natural question becomes: what is the distribution after n steps? The answer is one more line of the same algebra. Multiply the starting row vector on the right by P once for each step: the distribution after n steps is pi_n = pi_0 times P^n. Each multiplication by P advances the whole probability cloud one tick forward in time.

Write the current belief about the chain's location as a row vector pi whose entries are non-negative and sum to 1.
Advance one step by computing pi times P; the result is again a valid distribution because each row of P sums to 1, so probability is conserved.
Repeat to reach any horizon: pi times P times P, and so on — equivalently pi_0 times P^n for the n-step distribution.
To read the chance of a specific state at time n, pick out that entry of pi_n; to read i-to-j, instead read the (i, j) entry of P^n directly.

This vector view is also the cleanest way to see what is coming. For many well-behaved chains the sequence pi_0, pi_1, pi_2, ... stops moving: it settles toward a fixed row vector pi that satisfies pi P = pi, the [[limiting-distribution|limiting distribution]]. The hint already lives in our worked example — keep multiplying that weather matrix and the rain probability creeps toward a steady value no matter what the forecast was today. Guide 4 makes that convergence precise; for now, just notice that powering up P is exactly the act that reveals long-run behavior.

Honest cautions: what powering up does and does not tell you

A few traps worth naming. First, P^(n)(i, j) is the probability of being at j *at* time n, not the probability of ever visiting j by time n — those are different questions, and confusing them is a classic slip. Second, the entries of P^n are still genuine probabilities, but a single matrix entry is not a path: many distinct routes from i to j get added together inside it, and the equation deliberately throws away which route was taken. Third, do not expect P^n to converge for *every* chain. A chain that strictly alternates between two states has P^2 equal to the identity, so its powers oscillate forever and never settle — a phenomenon called periodicity, the subject of the next guide.

One last connection that pays off later. The reason a chain can forget its starting point and converge is that its states can talk to one another — if from any state you can eventually reach any other, the chain mixes its probability around freely. That property, packaged as [[communicating-classes|communicating classes]] and irreducibility, is what guarantees a single limiting distribution rather than several disconnected ones. Chapman-Kolmogorov supplies the n-step reachability that this classification is built on: state j is reachable from i precisely when P^(n)(i, j) is positive for some n. Multi-step transitions are thus not just a computational convenience — they are the very language in which the structure of a chain is described.