人工智能 2016

用深度神经网络与树搜索精通围棋

戴维·西尔弗等（谷歌 DeepMind）

先教机器围棋的「直觉」，再让它去搜索。AlphaGo，击败了冠军。

Choose your version

In depth · the introduction

一台机器先从人类高手那里学到了围棋的「手感」，又自己学会了往前推演——然后，在那个被认为最难被计算机攻克的棋上，击败了冠军。

把这个想法拆开看

围棋，是在 19×19 的棋盘上摆放黑白棋子的游戏。它看似简单，可能的对局数目却大得惊人——比可观测宇宙里的原子还多——所以计算机不可能靠把每一手都试一遍来取胜。几十年里，这把围棋远远地挡在了机器之外，哪怕计算机早已精通了国际象棋。

AlphaGo 的答案，是给机器两种判断力，就像一位强手所拥有的那样。一部分，是策略网络：它看一眼棋盘，提出寥寥几个值得考虑的着法——这是关于「该下在哪」的直觉。另一部分，是价值网络：它看一眼棋盘，猜谁占优——这是关于「这个局面有多好」的判断。有了这两种感觉，程序便不必把一切都看遍；它可以像人那样去搜索，只是快得多。

它从哪里来

这项工作出自伦敦的谷歌 DeepMind，由戴维·西尔弗与黄士杰主导、在杰米斯·哈萨比斯麾下完成，并于 2016 年发表在《自然》期刊上，署名二十人。他们先是让 AlphaGo 横扫其余每一个围棋程序，借此亮相。随后，在 2015 年 10 月，他们请来了樊麾——卫冕的欧洲冠军、一位职业棋手——闭门下了五盘正式对局。AlphaGo 五盘全胜。这是计算机第一次在不让子的全盘围棋中击败职业棋手，而专家们曾说，这还要再等十年。

它为何重要

围棋，一直是人类直觉与机器计算之间那道鸿沟的象征。击败职业棋手，表明一套配方——用深度网络学到直觉，再以审慎的搜索去打磨它——能在一个无人靠蛮力攻克的难题上，把那道鸿沟合拢。这套配方很快便远远越出了棋类；而这场胜利来得，比这一领域所预期的早了好几年，让人重新校准：这类学习系统，究竟能进步得多快。

可以这样想象

想象你正盘算一步棋，肩头站着一位睿智的教练。在数百种合法着法里，教练悄悄指向其中三四个值得细想的——这就是策略网络，替你收窄选项。接着，他不必把每一手都一路下到终局，只消瞥一眼随之出现的棋盘，便说「这个对你有利，那个对你不妙」——这就是价值网络，一眼判断一个局面。于是，你把有限的思考时间，只花在要紧处。下方的小工具，能让你调高这份「思考时间」，看搜索如何最终落定在最好的那一手上。

它的位置

AlphaGo 立足于本馆里已有的两条脉络。它的网络，是深度卷积网络——正是 AlexNet（2012）把这类网络推入主流的；它的搜索想法，承自 2000 年代发展起来的蒙特卡洛树搜索。在这篇论文之后的一年，DeepMind 让程序丢开人类棋谱，仅凭自对弈从头学起（AlphaGo Zero），又把这套方法推广到了国际象棋与将棋（AlphaZero）。这个模式——先学到好的猜测，再用搜索把它磨利——如今也现身于蛋白质折叠，甚至现身于今天的推理模型一步步想通一道题的方式之中。

The original document

Original source text

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis · Nature 529, 484–489 (28 January 2016)

Abstract

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.

Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves.

The abstract goes on to explain that these deep neural networks are trained by a novel combination of supervised learning from human expert games and reinforcement learning from games of self-play, and that the networks alone — without any lookahead search — already play at the level of state-of-the-art Monte Carlo tree search programs.

The search algorithm

We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks.

The paper details how each simulation descends the tree by selecting the move that maximizes the action value plus an exploration bonus weighted by the policy-network prior, evaluates the resulting leaf with the value network and a fast rollout, and backs the result up to update the visit counts and action values.

[ … ]

Results

Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.

This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

The full paper, with its network architectures, the elo-rating tournament tables, the analysis of the distributed search across CPUs and GPUs, and the record of the match against Fan Hui, runs the length of a Nature article and is available in full at the source below.

Google DeepMind · London · 2016