算法 1957

动态规划

理查德·贝尔曼

把一个庞大的决策，拆成一个反复追问的问题——从此处出发，我能做到的最好是什么？

Choose your version

In depth · the introduction

一个有着千百万种走法组合的纠缠计划，可以坍缩成一个反复追问的简单问题：「从此处出发，我能做到的最好是什么？」

核心想法

许多问题，其实是一连串的决策——每个选择都赚得或付出一些什么、并改变你的处境，而你想要的，是整条链子上累计的最好结果。把每一种可能的序列都检查一遍是没指望的；其数目会爆炸。贝尔曼的洞见，是别再去想整条序列，而开始去想「状态」。

给每一种处境一个数——它的价值，也就是你从那里出发仍能取得的最大总回报。那么最佳的第一步，无非就是那个「即时收益，加上它把你送到之处的价值」最大的一步。这条自我指涉的规则，就是贝尔曼方程；把它解上一次，你就同时得到了在每一种处境下的最佳决策。

它是如何诞生的

1950 年代初，贝尔曼在兰德公司工作，正置身冷战运筹学的核心——如何在时间中调度、补货、分配并控制各类系统。他注意到，这一大类「多阶段」问题共享着同一种结构，而这种结构是可以被利用、而非被对抗的。

他甚至给它取名时都格外用心。据他本人讲，资助兰德的国防部长厌恶「research（研究）」这个词，于是贝尔曼挑了「dynamic programming（动态规划）」——一听就很务实、叫人难以挑刺——好让他的数学得以保全。这个听上去人畜无害的标签，藏着二十世纪最强有力的想法之一。

它为何重要

动态规划把那些「暴力穷举要花掉比宇宙年龄还长时间」的问题，变成了计算机一眨眼就能算完的问题——靠的是从不把同一个子问题解第二遍。它成了经济学、工程、物流与生物学的标准工具；而在几十年后，又成了机器如何学会博弈、规划与驾驶的数学核心。

一个可以想象的画面

想象一张地图，每座城镇都立着一块路牌：「从这里到终点的最短时间」。你根本不必去想象整条路线——在每座城镇，你只要走那条「自身耗时，加上下一座城镇路牌上的数字」最小的路。动态规划，就是从终点倒着把所有这些路牌一一填好的方法。而贝尔曼方程，正是每块路牌所遵循的那条规则。

它的位置

贝尔曼的递归，是 Dijkstra 最短路径算法（亦在本馆，1959）的表亲，也是物理学中哈密顿–雅可比方程的离散孪生；几乎在同一时刻，苏联的庞特里亚金为连续控制得出了一个平行的结果。它最大的现代后裔，是强化学习——那些攻克了围棋与 Atari 的深度系统（Silver 2016；Mnih 2013），骨子里，就是用试错与一个神经网络去求解的贝尔曼方程。

The original document

Original source text

Richard Bellman · Dynamic Programming · Princeton University Press, 1957 (developed from his address to the American Mathematical Society, Laramie, 2 September 1954)

The setting: multi-stage decision processes

Bellman's subject is a system that moves through a sequence of stages. At each stage it sits in some state; a decision is taken; a reward (or cost) is earned and the system passes to a new state. The aim is to choose the whole sequence of decisions — a policy — so as to maximise the total return. Enumerating every possible sequence is combinatorially hopeless, so a different handle is needed.

The Principle of Optimality

The handle is a single observation about the structure of any optimal plan, which Bellman states as a principle:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

The functional equation

The principle turns the search into a recursion on the value function — the best total return obtainable from a given state. The value of a state equals the best, over all available decisions, of the immediate reward plus the (discounted) value of the state the decision leads to. This self-referential relation is the Bellman equation; solving it once yields the optimal decision in every state at the same time.

The curse of dimensionality

Bellman named the method's own limitation. Because the number of states grows exponentially with the number of state variables, tabulating and sweeping the value function becomes intractable for high-dimensional or continuous problems — the obstacle he christened the “curse of dimensionality.”

Richard Bellman · The RAND Corporation, Santa Monica