Artificial Intelligence 2013

Playing Atari with Deep Reinforcement Learning

Volodymyr Mnih et al. (DeepMind)

From raw pixels and a score, one network learns to play — and the DQN era begins.

Choose your version

In depth · the introduction

Hand a computer nothing but the moving pixels of an Atari screen and the score — no rules, no strategy — and it teaches itself to play. That experiment lit the fuse on modern AI agents.

The idea, unpacked

Most AI of the time learned from labelled examples: here is a photo, here is the word 'cat.' But nobody can label the right move in a video game frame by frame. This work used a different kind of learning, called reinforcement learning: the program tries things, sees its score go up or down, and gradually figures out which actions tend to pay off later.

The leap was to feed it the raw screen — just pixels — and let a neural network discover, on its own, both what it was looking at and what to do about it. One and the same program, with nothing changed between games, learned to play seven different Atari titles. It was christened DQN, for Deep Q-Network.

Where it came from

In December 2013, a small team at a London company called DeepMind — soon to be acquired by Google — posted a nine-page paper to a workshop. The pieces it used were not new: the trial-and-error rule called Q-learning dated to 1992, the trick of replaying past experiences to 1993. What was new was getting them to work together on raw vision, where earlier attempts had a habit of spiralling out of control.

Their fix was disarmingly simple. Instead of learning from each moment as it happened — where one frame looks almost exactly like the next, and the lessons all blur together — they stored the agent's experiences in a memory and replayed them in shuffled, random batches. That one change, plus a few practical tweaks, kept the learning steady. Two years later the same group put their agent on the cover of Nature.

Why it mattered

Before this, an AI that could see and act usually needed a human to hand-design what features in the image mattered. DQN showed you could skip that step: give the machine the raw picture and a reward, and it would work out the rest. That made the recipe general. The very same idea — learn to act from raw perception and a score — runs from the game-playing AlphaGo through the robots that learn to grasp objects, all the way to the training step that turns a raw language model into a helpful chatbot.

An everyday picture

Think of learning a new video game with the sound off and no instructions. At first you mash buttons at random. But you notice that certain moves are followed, a few seconds later, by the score ticking up — so you start doing more of those. Bit by bit you trace the points backward to the actions that earned them. That backward flow of credit, from reward to the moves that led to it, is exactly what the agent below is doing on a tiny grid. Drag the slider and watch the good moves light up.

Where it sits

This is the founding paper of deep reinforcement learning, and it sits at a crossroads in the Library. It marries two earlier currents — neural networks that learn to see (the line that runs through AlexNet) and reinforcement learning's century-old question of how a creature learns from reward. The agents that followed, including AlphaGo's mastery of the board game Go, all trace back to this experiment with a joystick and a screen.

The original document

Original source text

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller · DeepMind Technologies · NIPS Deep Learning Workshop 2013 (arXiv:1312.5602, 19 Dec 2013)

Abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

The abstract then states that the same architecture and learning algorithm — with no per-game adjustment — is applied to seven Atari 2600 games from the Arcade Learning Environment.

We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Introduction — the obstacles

The introduction sets out why reinforcement learning had not yet conquered raw vision: rewards are sparse, noisy and delayed; consecutive observations are highly correlated rather than independent; and the data distribution itself shifts as the agent's behaviour changes.

Furthermore, in RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

Background — the Bellman equation

The optimal action-value function obeys the Bellman equation: if the value of the next state were known for every possible action, the optimal strategy is to select the action maximising the expected value of the immediate reward plus the discounted value of what follows.

The optimal action-value function obeys an important identity known as the Bellman equation.

Deep Reinforcement Learning — experience replay

The central trick is experience replay: each transition is stored in a replay memory, and learning updates are drawn at random from that pool, breaking the correlations between consecutive samples and smoothing the training distribution.

By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

Preprocessing and architecture

The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units.

Raw 210 × 160 frames are converted to grey-scale, down-sampled and cropped to 84 × 84, and the last four frames are stacked into the network's input. A single output unit gives the predicted Q-value for each action — so one forward pass scores every action at once.

[ … ]

Experiments — results

Across all seven games — Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest and Space Invaders — the same network, trained with RMSProp on minibatches of 32 over 10 million frames with an ε-greedy policy annealed from 1.0 to 0.1, learns purely from pixels and the score.

So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them.

DeepMind · December 2013