A process that travels light
You arrive at this rung already fluent in a stochastic process — a whole family of random variables X_0, X_1, X_2, ... indexed by time — and you have met the random walk, where each step is independent of every step before it. A Markov chain sits in between two extremes. At one extreme, full independence: the next value ignores everything, including where you are now. At the other extreme, full memory: the next value can depend on the entire recorded history X_0, X_1, ..., X_n. The Markov chain keeps exactly one shred of memory — the current position — and throws away the rest. It travels light.
Concretely, picture a frog on five lily pads. At each tick of the clock it hops to a neighbouring pad according to some fixed odds. To predict where the frog goes next, does it matter how it arrived at its current pad — whether it came from the left or has been circling for an hour? Under the Markov assumption, no. Knowing the frog is on pad 3 right now tells you everything the past could have told you about its next hop. The history is not deleted from existence; it is simply *irrelevant* once the present is known.
The set of places the process can be — the five pads, or the integers, or a list of weather conditions — is its [[state-space|state space]]. In this rung we work with a discrete state space (a finite or countable list of states) and discrete time (clock ticks n = 0, 1, 2, ...). That combination is exactly the [[discrete-time-markov-chain|discrete-time Markov chain]], and it is the workhorse the whole rung is built around.
The Markov property, said precisely
The intuition "the future depends only on the present" deserves an exact statement, because casual phrasings hide a trap. The [[markov-property|Markov property]] says: the conditional distribution of the next state, given the *entire* past, equals the conditional distribution given only the *current* state. In symbols, for any states i_0, ..., i_(n-1), i, j, P(X_(n+1) = j given X_n = i, X_(n-1) = i_(n-1), ..., X_0 = i_0) = P(X_(n+1) = j given X_n = i). Every term to the left of X_n = i is dropped — not because it has no value, but because conditioning on the present already absorbs it.
There is a second, easy-to-overlook assumption baked into the chains of this rung: the transition odds do not change over time. P(X_(n+1) = j given X_n = i) is the same number whether n is 3 or 3000. A chain with this property is called time-homogeneous, and it lets us name that number once and for all: p_ij, the probability of going from state i to state j in one step. Homogeneity is a genuine assumption with real exceptions — a frog that tires as the day wears on would violate it — so it is worth stating out loud rather than smuggling in.
Packing the rules into a matrix
Here is the move that makes Markov chains so pleasant to compute with. If the state space has N states, collect every one-step probability p_ij into an N-by-N grid. Row i lists where you might go starting from state i; the entry in row i, column j is p_ij. This grid is the [[transition-matrix|transition matrix]], usually written P. The entire dynamic law of the chain — every hop the frog could ever make — now lives in one tidy object.
States: Sun Cloud Rain Row i must sum to 1
Sun Cloud Rain
Sun [ 0.7 0.2 0.1 ] <- from a Sunny day
Cloud[ 0.3 0.4 0.3 ] <- from a Cloudy day
Rain [ 0.2 0.5 0.3 ] <- from a Rainy day
row sums: 1.0 1.0 1.0 (every row is a distribution)Two facts about P are not optional, they are forced by the axioms of probability. First, every entry is between 0 and 1, because each p_ij is a probability. Second — and this is the rule people forget — each row sums to exactly 1. Row i is the full list of where you can be after one step starting from i, and the chain must land *somewhere*, so those probabilities are exhaustive and add to one. A square matrix of non-negative entries whose rows each sum to 1 is called a stochastic matrix; transition matrices and stochastic matrices are the same thing. (Columns generally do not sum to 1, and expecting them to is a classic beginner slip.)
From present to future: a vector meets the matrix
The matrix on its own tells you how one state leads to the next. To actually predict the future you also need a starting point — but in probability the start is rarely a single certain state. It is a distribution: the [[initial-distribution|initial distribution]], a row vector mu_0 = (mu_0(1), mu_0(2), ..., mu_0(N)) giving the probability the chain begins in each state. If you are sure it starts in state 2, then mu_0 = (0, 1, 0, ...); if it starts uniformly at random, every entry is 1/N.
Now the magic. The distribution of the state one step later is obtained by the matrix product mu_1 = mu_0 P (row vector times matrix). Why does this work? It is nothing but the law of total probability in disguise: the chance of being in state j tomorrow is the sum, over every possible state i today, of (chance you are in i today) times (chance i hops to j). That sum, over i, of mu_0(i) p_ij is precisely the j-th entry of the product mu_0 P. Prediction has become matrix multiplication.
- Write today's belief as a row vector mu (its entries are probabilities, so they sum to 1).
- Multiply on the right by the transition matrix: mu_next = mu P. The result is again a valid distribution (its entries still sum to 1).
- Repeat to march forward in time: two steps is mu P P = mu P^2, and n steps is mu P^n.
- Sanity-check at every stage that the vector's entries are non-negative and sum to 1; if they do not, a row of P probably did not sum to 1.
Worked tiny example with the weather matrix above. Suppose today is certainly Sunny, so mu_0 = (1, 0, 0). Then mu_1 = mu_0 P just reads off the first row: (0.7, 0.2, 0.1) — a 70% chance of Sun tomorrow. For the day after, mu_2 = mu_1 P. Its Sun-entry is 0.7(0.7) + 0.2(0.3) + 0.1(0.2) = 0.49 + 0.06 + 0.02 = 0.57. Notice the certainty has already begun to blur: from a guaranteed Sunny start, two days out the chance of Sun has relaxed from 1 to 0.57, drifting toward a long-run balance you will study in guide 4.
What the chain captures — and what to watch for
Why is this one assumption worth a whole rung? Because an astonishing range of real systems are honestly memoryless once the state is chosen well: a board-game token, a shuffled deck approached one riffle at a time, the link a web surfer clicks (the idea behind PageRank), the queue length at a service counter, a gene drifting between alleles, the sequence of letters a hidden Markov model tries to decode. In each case the present state is a *sufficient summary* of the past for the purpose of predicting the future. Finding such a state is the modelling art; once found, the matrix machinery does the rest.
Three threads to carry into the rest of the rung. The matrix P holds the one-step law; raising it to a power, P^n, will give the n-step law and lead straight into the Chapman-Kolmogorov equations of guide 2. The structure of P — which states can reach which — will let us classify states as recurrent or transient in guide 3. And the long-run behaviour, where mu P^n settles toward a fixed stationary distribution regardless of where it started (you already saw the blurring begin with the weather), is the destination of guide 4. Everything downstream is a question about one stochastic matrix.