Artificial Intelligence 2016

Mastering the Game of Go with Deep Neural Networks and Tree Search

David Silver et al. (Google DeepMind)

Teach a machine the intuition of Go — then let it search. AlphaGo beats a champion.

Choose your version

In depth · the introduction

A machine learned the feel of Go from human masters, then learned to look ahead on its own — and beat a champion at the game thought hardest for computers to crack.

The idea, unpacked

Go is a game of placing black and white stones on a 19×19 grid. It looks simple, but the number of possible games is astronomical — more than the atoms in the observable universe — so a computer cannot win by simply trying every move. For decades that left Go far beyond machines, even after computers had mastered chess.

AlphaGo's answer was to give the machine two kinds of judgement, the way a strong human player has them. One part, the policy network, looks at a board and suggests the few moves worth considering — that is intuition about where to play. The other part, the value network, looks at a board and guesses who is winning — that is judgement about how good a position is. With those two senses, the program no longer has to examine everything; it can search like a person, only much faster.

Where it came from

The work came out of Google DeepMind in London, led by David Silver and Aja Huang under Demis Hassabis, and was published in the journal Nature in 2016 with twenty authors. They first showed AlphaGo by letting it crush every other Go program. Then, in October 2015, they brought in Fan Hui — the reigning European champion and a professional player — for five formal games behind closed doors. AlphaGo won all five. It was the first time a computer had beaten a professional at full-board Go without any handicap, something experts had said was still a decade away.

Why it mattered

Go had been a symbol of the gap between human intuition and machine calculation. Beating a professional showed that a single recipe — learn the intuition with deep networks, then refine it with deliberate search — could close that gap on a problem no one had cracked by brute force. The same recipe soon reached well beyond games, and the win arrived years ahead of what the field had expected, recalibrating how fast such learning systems might progress.

A way to picture it

Imagine planning a chess-like move with a wise coach at your shoulder. Out of hundreds of legal moves, the coach quietly points at the three or four worth thinking about — that is the policy network, narrowing your options. Then, instead of playing each one all the way to the end of the game, the coach glances at the resulting board and says "this looks good for you, this looks bad" — that is the value network, judging a position at a glance. You spend your limited thinking time only where it counts. The widget below lets you turn up that thinking time and watch the search settle on the best move.

Where it sits

AlphaGo built on two streams already in the Library. Its networks are deep convolutional networks of the kind that AlexNet (2012) launched into the mainstream; the search idea descends from Monte Carlo tree search developed in the 2000s. The year after this paper, DeepMind let the program throw away the human games and learn entirely from self-play (AlphaGo Zero), and then generalized the method to chess and shogi (AlphaZero). The pattern — learn good guesses, then search to sharpen them — now turns up in protein folding and even in how today's reasoning models think through a problem step by step.

The original document

Original source text

D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis · Nature 529, 484–489 (28 January 2016)

Abstract

The game of Go has long been viewed as the most challenging of classic games for artificial intelligence owing to its enormous search space and the difficulty of evaluating board positions and moves.

Here we introduce a new approach to computer Go that uses 'value networks' to evaluate board positions and 'policy networks' to select moves.

The abstract goes on to explain that these deep neural networks are trained by a novel combination of supervised learning from human expert games and reinforcement learning from games of self-play, and that the networks alone — without any lookahead search — already play at the level of state-of-the-art Monte Carlo tree search programs.

The search algorithm

We also introduce a new search algorithm that combines Monte Carlo simulation with value and policy networks.

The paper details how each simulation descends the tree by selecting the move that maximizes the action value plus an exploration bonus weighted by the policy-network prior, evaluates the resulting leaf with the value network and a fast rollout, and backs the result up to update the visit counts and action values.

[ … ]

Results

Using this search algorithm, our program AlphaGo achieved a 99.8% winning rate against other Go programs, and defeated the human European Go champion by 5 games to 0.

This is the first time that a computer program has defeated a human professional player in the full-sized game of Go, a feat previously thought to be at least a decade away.

The full paper, with its network architectures, the elo-rating tournament tables, the analysis of the distributed search across CPUs and GPUs, and the record of the match against Fan Hui, runs the length of a Nature article and is available in full at the source below.

Google DeepMind · London · 2016