Dynamic Programming
Break a giant decision into one repeated question — what is the best I can do from here?
A tangled plan with millions of possible move-sequences can collapse into one simple question, asked over and over: “What's the best I can do from here?”
The big idea
Many problems are really chains of decisions — each choice earns or costs something and changes your situation, and you want the best total outcome over the whole chain. Checking every possible sequence is hopeless; the count explodes. Bellman's insight was to stop thinking about whole sequences and start thinking about states.
Give every situation a single number — its value, the best total reward you can still get from there. Then the best first move is just the one whose immediate payoff, plus the value of wherever it lands you, is largest. That self-referential rule is the Bellman equation, and solving it once gives you the best decision in every situation at the same time.
How it came about
Bellman worked at the RAND Corporation in the early 1950s, deep in Cold War operations research — how to schedule, stock, allocate and control systems over time. He noticed that a whole class of these “multi-stage” problems shared one structure, and that the structure could be exploited rather than fought.
He even chose the name carefully. By his own telling, the Secretary of Defense who funded RAND loathed the word “research,” so Bellman picked “dynamic programming” — businesslike, hard to object to — to keep his mathematics safe. The harmless-sounding label hid one of the century's most powerful ideas.
Why it mattered
Dynamic programming turned problems that would take longer than the age of the universe to brute-force into ones a computer finishes in a blink — by never solving the same sub-problem twice. It became a standard tool of economics, engineering, logistics and biology, and, decades later, the mathematical heart of how machines learn to play, plan and drive.
A way to picture it
Imagine a map where every town carries a signpost: “shortest time from here to the destination.” You never have to imagine whole routes — at each town you just take the road whose own travel time, plus the number on the next town's signpost, is smallest. Dynamic programming is the method that fills in all those signposts, working backward from the destination. The Bellman equation is exactly the rule each signpost obeys.
Where it sits
Bellman's recursion is a cousin of Dijkstra's shortest-path algorithm (also in this Library, 1959) and a discrete twin of the Hamilton–Jacobi equations of physics; in the Soviet Union, Pontryagin reached a parallel result for continuous control at almost the same moment. Its biggest modern descendant is reinforcement learning — the deep systems that mastered Go and Atari (Silver 2016; Mnih 2013) are, underneath, Bellman's equation solved by trial, error and a neural network.
An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.