人工智慧 2016

用深度神經網路與樹搜尋精通圍棋

戴維·席爾弗等（Google DeepMind）

先教機器圍棋的「直覺」，再讓它去搜尋。AlphaGo，擊敗了冠軍。

Choose your version

In depth · the introduction

一臺機器先從人類高手那裡學到了圍棋的「手感」，又自己學會了往前推演——然後，在那個被認為最難被電腦攻克的棋上，擊敗了冠軍。

把這個想法拆開看

圍棋，是在 19×19 的棋盤上擺放黑白棋子的遊戲。它看似簡單，可能的對局數目卻大得驚人——比可觀測宇宙裡的原子還多——所以電腦不可能靠把每一手都試一遍來取勝。幾十年裡，這把圍棋遠遠地擋在了機器之外，哪怕電腦早已精通了西洋棋。

AlphaGo 的答案，是給機器兩種判斷力，就像一位強手所擁有的那樣。一部分，是策略網路：它看一眼棋盤，提出寥寥幾個值得考慮的著法——這是關於「該下在哪」的直覺。另一部分，是價值網路：它看一眼棋盤，猜誰佔優——這是關於「這個局面有多好」的判斷。有了這兩種感覺，程式便不必把一切都看遍；它可以像人那樣去搜尋，只是快得多。

它從哪裡來

這項工作出自倫敦的 Google DeepMind，由戴維·席爾弗與黃士傑主導、在傑米斯·哈薩比斯麾下完成，並於 2016 年發表在《自然》期刊上，署名二十人。他們先是讓 AlphaGo 橫掃其餘每一個圍棋程式，藉此亮相。隨後，在 2015 年 10 月，他們請來了樊麾——衛冕的歐洲冠軍、一位職業棋手——閉門下了五盤正式對局。AlphaGo 五盤全勝。這是電腦第一次在不讓子的全盤圍棋中擊敗職業棋手，而專家們曾說，這還要再等十年。

它為何重要

圍棋，一直是人類直覺與機器計算之間那道鴻溝的象徵。擊敗職業棋手，表明一套配方——用深度網路學到直覺，再以審慎的搜尋去打磨它——能在一個無人靠蠻力攻克的難題上，把那道鴻溝合攏。這套配方很快便遠遠越出了棋類；而這場勝利來得，比這一領域所預期的早了好幾年，讓人重新校準：這類學習系統，究竟能進步得多快。

可以這樣想像

想像你正盤算一步棋，肩頭站著一位睿智的教練。在數百種合法著法裡，教練悄悄指向其中三四個值得細想的——這就是策略網路，替你收窄選項。接著，他不必把每一手都一路下到終局，只消瞥一眼隨之出現的棋盤，便說「這個對你有利，那個對你不妙」——這就是價值網路，一眼判斷一個局面。於是，你把有限的思考時間，只花在要緊處。下方的小工具，能讓你調高這份「思考時間」，看搜尋如何最終落定在最好的那一手上。

它的位置

AlphaGo 立足於本館裡已有的兩條脈絡。它的網路，是深度卷積網路——正是 AlexNet（2012）把這類網路推入主流的；它的搜尋想法，承自 2000 年代發展起來的蒙地卡羅樹搜尋。在這篇論文之後的一年，DeepMind 讓程式丟開人類棋譜，僅憑自我對弈從頭學起（AlphaGo Zero），又把這套方法推廣到了西洋棋與將棋（AlphaZero）。這個模式——先學到好的猜測，再用搜尋把它磨利——如今也現身於蛋白質摺疊，甚至現身於今天的推理模型一步步想通一道題的方式之中。

The original document

Original source text

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis · Nature 529, 484–489 (28 January 2016)

Abstract

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.

Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves.

The abstract goes on to explain that these deep neural networks are trained by a novel combination of supervised learning from human expert games and reinforcement learning from games of self-play, and that the networks alone — without any lookahead search — already play at the level of state-of-the-art Monte Carlo tree search programs.

The search algorithm

We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks.

The paper details how each simulation descends the tree by selecting the move that maximizes the action value plus an exploration bonus weighted by the policy-network prior, evaluates the resulting leaf with the value network and a fast rollout, and backs the result up to update the visit counts and action values.

[ … ]

Results

Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.

This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

The full paper, with its network architectures, the elo-rating tournament tables, the analysis of the distributed search across CPUs and GPUs, and the record of the match against Fan Hui, runs the length of a Nature article and is available in full at the source below.

Google DeepMind · London · 2016