Algorithms 1957

Dynamic Programming

Richard Bellman

Break a giant decision into one repeated question — what is the best I can do from here?

Choose your version

In depth · the introduction

A tangled plan with millions of possible move-sequences can collapse into one simple question, asked over and over: “What's the best I can do from here?”

The big idea

Many problems are really chains of decisions — each choice earns or costs something and changes your situation, and you want the best total outcome over the whole chain. Checking every possible sequence is hopeless; the count explodes. Bellman's insight was to stop thinking about whole sequences and start thinking about states.

Give every situation a single number — its value, the best total reward you can still get from there. Then the best first move is just the one whose immediate payoff, plus the value of wherever it lands you, is largest. That self-referential rule is the Bellman equation, and solving it once gives you the best decision in every situation at the same time.

How it came about

Bellman worked at the RAND Corporation in the early 1950s, deep in Cold War operations research — how to schedule, stock, allocate and control systems over time. He noticed that a whole class of these “multi-stage” problems shared one structure, and that the structure could be exploited rather than fought.

He even chose the name carefully. By his own telling, the Secretary of Defense who funded RAND loathed the word “research,” so Bellman picked “dynamic programming” — businesslike, hard to object to — to keep his mathematics safe. The harmless-sounding label hid one of the century's most powerful ideas.

Why it mattered

Dynamic programming turned problems that would take longer than the age of the universe to brute-force into ones a computer finishes in a blink — by never solving the same sub-problem twice. It became a standard tool of economics, engineering, logistics and biology, and, decades later, the mathematical heart of how machines learn to play, plan and drive.

A way to picture it

Imagine a map where every town carries a signpost: “shortest time from here to the destination.” You never have to imagine whole routes — at each town you just take the road whose own travel time, plus the number on the next town's signpost, is smallest. Dynamic programming is the method that fills in all those signposts, working backward from the destination. The Bellman equation is exactly the rule each signpost obeys.

Where it sits

Bellman's recursion is a cousin of Dijkstra's shortest-path algorithm (also in this Library, 1959) and a discrete twin of the Hamilton–Jacobi equations of physics; in the Soviet Union, Pontryagin reached a parallel result for continuous control at almost the same moment. Its biggest modern descendant is reinforcement learning — the deep systems that mastered Go and Atari (Silver 2016; Mnih 2013) are, underneath, Bellman's equation solved by trial, error and a neural network.

The original document

Original source text

Richard Bellman · Dynamic Programming · Princeton University Press, 1957 (developed from his address to the American Mathematical Society, Laramie, 2 September 1954)

The setting: multi-stage decision processes

Bellman's subject is a system that moves through a sequence of stages. At each stage it sits in some state; a decision is taken; a reward (or cost) is earned and the system passes to a new state. The aim is to choose the whole sequence of decisions — a policy — so as to maximise the total return. Enumerating every possible sequence is combinatorially hopeless, so a different handle is needed.

The Principle of Optimality

The handle is a single observation about the structure of any optimal plan, which Bellman states as a principle:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

The functional equation

The principle turns the search into a recursion on the value function — the best total return obtainable from a given state. The value of a state equals the best, over all available decisions, of the immediate reward plus the (discounted) value of the state the decision leads to. This self-referential relation is the Bellman equation; solving it once yields the optimal decision in every state at the same time.

The curse of dimensionality

Bellman named the method's own limitation. Because the number of states grows exponentially with the number of state variables, tabulating and sweeping the value function becomes intractable for high-dimensional or continuous problems — the obstacle he christened the “curse of dimensionality.”

Richard Bellman · The RAND Corporation, Santa Monica