Markov Decision Processes

From a loop to a problem statement

In the previous guide you met the agent-environment loop: observe, act, get a consequence, observe again, forever. That loop is vivid, but it is not yet something you can compute over. It tells you the *rhythm* of reinforcement learning without saying what, precisely, the agent is choosing between or what 'good' even means. The Markov decision process (MDP) is the bridge from that intuitive loop to a formal problem statement — the thing every RL algorithm in this rung actually operates on.

Think of an MDP as the rulebook of a board game written in mathematics. Chess has a board, legal moves, a rule for what the board looks like after each move, and a way to win. An MDP captures exactly that structure for *any* decision problem under uncertainty. Once a robot vacuum, a chess engine, or a portfolio manager is written as an MDP, they all become the same kind of object — and the same algorithms can chew on them. That generality is the whole reason the framing matters.

The five ingredients

An MDP pins down five things. States are the situations the agent can be in — a chessboard arrangement, a robot's position and sensor readings, the squares on a Snakes-and-Ladders board. Actions are the moves available in each state — turn left, play knight-to-f3, buy or sell. Transitions are the rule for which state comes next after an action, and here is the first subtlety: that rule can be *random*. Roll a die and you might land anywhere among six squares. The transition isn't 'state B follows' but 'state B with probability one-sixth.'

The fourth ingredient is the [[reward|reward]]: a single number the environment hands back after each move, saying how good that move was *right then*. Plus one for cleaning a fresh patch of floor, minus one for bumping a wall. The reward is not the same as winning — it is the local, immediate score, and the agent's job is to figure out how those little numbers add up over a long game. The fifth ingredient, the [[discount-factor|discount factor]], governs exactly that adding-up, and it deserves its own section.

The Markov property: why only 'now' matters

The 'Markov' in Markov decision process names one specific, load-bearing assumption: the next state depends only on the *current* state and action — not on the entire history of how you arrived there. In chess this is literally true. The current arrangement of pieces tells you everything you need to choose your move; it makes no difference in what order the pieces got there. The present is a complete summary of the past.

This assumption is what makes the math tractable. Because the agent only ever has to reason about 'where am I now,' not 'everything that has ever happened,' the problem stays a manageable size and the value-and-Bellman machinery you will meet in the next guides clicks into place. Drop the Markov property and the elegant recursion at the heart of RL falls apart.

Return: the long game the agent is actually playing

A reward is a single step's score. But the agent doesn't care about one step — it cares about the whole trajectory. The total it is trying to maximize is the [[return-rl|return]]: the sum of all the rewards it collects from now until the end. This is why a good agent will happily take a *negative* reward now — sacrificing a piece, spending money, taking the long way around — if it leads to far more reward later. Optimizing the return, not the immediate reward, is what separates planning from greed.

But there is a problem with 'sum of all rewards.' In a task that can run forever, that sum can blow up to infinity, and infinity is useless for comparing strategies — every strategy that never dies scores the same. The discount factor, usually written gamma (γ), fixes this. A reward one step away counts in full; a reward two steps away is multiplied by γ; three steps away, by γ times γ; and so on, shrinking with every step of delay. Since γ sits between 0 and 1, even an endless stream of rewards adds up to a finite, comparable number.

return = r0 + γ·r1 + γ²·r2 + γ³·r3 + ...

 γ = 0.9 , reward +100 ten steps away
 -> 100 · 0.9¹⁰  ≈  35  (impatient)

 γ = 0.99, same +100 ten steps away
 -> 100 · 0.99¹⁰ ≈  90  (far-sighted)

The discounted return. The same prize and delay, two very different appetites for waiting — set by one dial, γ.

So γ is more than a numerical safety valve — it sets the agent's *personality*. Near 0, the agent is shortsighted, grabbing whatever pays off immediately. Near 1 (say 0.99), it is patient, willing to forgo a small reward now for a bigger one many steps later. And it is a knob *you* choose, not something the world tells you: set it too low and the agent can't 'see' a distant goal well enough to ever learn to reach it; set it too high and the far-off, uncertain sums make learning noisier and slower to settle.

Putting it together, and what the framing buys you

Let's assemble a tiny, complete MDP — Snakes and Ladders — to see the five pieces snap together. It is about as clean as MDPs get, because there are no real decisions to make; it isolates the structure itself.

State: the square your token sits on. Where you've been doesn't matter — only where you are. (That is the Markov property, made concrete.)
Action: roll the die. (There's only one possible action — which is why no skill is involved.)
Transition: the die lands 1–6 (each with probability one-sixth), you advance, then any snake slides you down or ladder lifts you up. Random, but fully specified.
Reward: +1 the moment you reach the final square, 0 everywhere else.
Discount: γ close to 1, since reaching the end a few turns sooner is only mildly better than reaching it later.

With the problem written this way, a precise question becomes askable: starting from this square, what is the best return I can expect? That number — attached to every state — is the value function, the subject of the next guide, and the recursive relationship that links each state's value to its neighbors' is the Bellman equation. None of that machinery is available until the problem is an MDP. The framing isn't bureaucracy; it is what makes 'learn by trial and error' computable at all.