演算法 1957

動態規劃

理查德·貝爾曼

把一個龐大的決策，拆成一個反覆追問的問題——從此處出發，我能做到的最好是什麼？

Choose your version

In depth · the introduction

一個有著千百萬種走法組合的糾纏計劃，可以坍縮成一個反覆追問的簡單問題：「從此處出發，我能做到的最好是什麼？」

核心想法

許多問題，其實是一連串的決策——每個選擇都賺得或付出一些什麼、並改變你的處境，而你想要的，是整條鏈子上累計的最好結果。把每一種可能的序列都檢查一遍是沒指望的；其數目會爆炸。貝爾曼的洞見，是別再去想整條序列，而開始去想「狀態」。

給每一種處境一個數——它的價值，也就是你從那裡出發仍能取得的最大總回報。那麼最佳的第一步，無非就是那個「即時收益，加上它把你送到之處的價值」最大的一步。這條自我指涉的規則，就是貝爾曼方程；把它解上一次，你就同時得到了在每一種處境下的最佳決策。

它是如何誕生的

1950 年代初，貝爾曼在蘭德公司工作，正置身冷戰運籌學的核心——如何在時間中調度、補貨、分配並控制各類系統。他注意到，這一大類「多階段」問題共享著同一種結構，而這種結構是可以被利用、而非被對抗的。

他甚至給它取名時都格外用心。據他本人講，資助蘭德的國防部長厭惡「research（研究）」這個詞，於是貝爾曼挑了「dynamic programming（動態規劃）」——一聽就很務實、叫人難以挑刺——好讓他的數學得以保全。這個聽上去人畜無害的標籤，藏著二十世紀最強有力的想法之一。

它為何重要

動態規劃把那些「暴力窮舉要花掉比宇宙年齡還長時間」的問題，變成了計算機一眨眼就能算完的問題——靠的是從不把同一個子問題解第二遍。它成了經濟學、工程、物流與生物學的標準工具；而在幾十年後，又成了機器如何學會博弈、規劃與駕駛的數學核心。

一個可以想像的畫面

想像一張地圖，每座城鎮都立著一塊路牌：「從這裡到終點的最短時間」。你根本不必去想像整條路線——在每座城鎮，你只要走那條「自身耗時，加上下一座城鎮路牌上的數字」最小的路。動態規劃，就是從終點倒著把所有這些路牌一一填好的方法。而貝爾曼方程，正是每塊路牌所遵循的那條規則。

它的位置

貝爾曼的遞迴，是 Dijkstra 最短路徑演算法（亦在本館，1959）的表親，也是物理學中哈密頓–雅可比方程的離散孿生；幾乎在同一時刻，蘇聯的龐特里亞金為連續控制得出了一個平行的結果。它最大的現代後裔，是強化學習——那些攻克了圍棋與 Atari 的深度系統（Silver 2016；Mnih 2013），骨子裡，就是用試錯與一個神經網絡去求解的貝爾曼方程。

The original document

Original source text

Richard Bellman · Dynamic Programming · Princeton University Press, 1957 (developed from his address to the American Mathematical Society, Laramie, 2 September 1954)

The setting: multi-stage decision processes

Bellman's subject is a system that moves through a sequence of stages. At each stage it sits in some state; a decision is taken; a reward (or cost) is earned and the system passes to a new state. The aim is to choose the whole sequence of decisions — a policy — so as to maximise the total return. Enumerating every possible sequence is combinatorially hopeless, so a different handle is needed.

The Principle of Optimality

The handle is a single observation about the structure of any optimal plan, which Bellman states as a principle:

An optimal policy has the property that whatever the initial state and initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.

The functional equation

The principle turns the search into a recursion on the value function — the best total return obtainable from a given state. The value of a state equals the best, over all available decisions, of the immediate reward plus the (discounted) value of the state the decision leads to. This self-referential relation is the Bellman equation; solving it once yields the optimal decision in every state at the same time.

The curse of dimensionality

Bellman named the method's own limitation. Because the number of states grows exponentially with the number of state variables, tabulating and sweeping the value function becomes intractable for high-dimensional or continuous problems — the obstacle he christened the “curse of dimensionality.”

Richard Bellman · The RAND Corporation, Santa Monica