Playing Atari with Deep Reinforcement Learning
From raw pixels and a score, one network learns to play — and the DQN era begins.
Hand a computer nothing but the moving pixels of an Atari screen and the score — no rules, no strategy — and it teaches itself to play. That experiment lit the fuse on modern AI agents.
The idea, unpacked
Most AI of the time learned from labelled examples: here is a photo, here is the word 'cat.' But nobody can label the right move in a video game frame by frame. This work used a different kind of learning, called reinforcement learning: the program tries things, sees its score go up or down, and gradually figures out which actions tend to pay off later.
The leap was to feed it the raw screen — just pixels — and let a neural network discover, on its own, both what it was looking at and what to do about it. One and the same program, with nothing changed between games, learned to play seven different Atari titles. It was christened DQN, for Deep Q-Network.
Where it came from
In December 2013, a small team at a London company called DeepMind — soon to be acquired by Google — posted a nine-page paper to a workshop. The pieces it used were not new: the trial-and-error rule called Q-learning dated to 1992, the trick of replaying past experiences to 1993. What was new was getting them to work together on raw vision, where earlier attempts had a habit of spiralling out of control.
Their fix was disarmingly simple. Instead of learning from each moment as it happened — where one frame looks almost exactly like the next, and the lessons all blur together — they stored the agent's experiences in a memory and replayed them in shuffled, random batches. That one change, plus a few practical tweaks, kept the learning steady. Two years later the same group put their agent on the cover of Nature.
Why it mattered
Before this, an AI that could see and act usually needed a human to hand-design what features in the image mattered. DQN showed you could skip that step: give the machine the raw picture and a reward, and it would work out the rest. That made the recipe general. The very same idea — learn to act from raw perception and a score — runs from the game-playing AlphaGo through the robots that learn to grasp objects, all the way to the training step that turns a raw language model into a helpful chatbot.
An everyday picture
Think of learning a new video game with the sound off and no instructions. At first you mash buttons at random. But you notice that certain moves are followed, a few seconds later, by the score ticking up — so you start doing more of those. Bit by bit you trace the points backward to the actions that earned them. That backward flow of credit, from reward to the moves that led to it, is exactly what the agent below is doing on a tiny grid. Drag the slider and watch the good moves light up.
Where it sits
This is the founding paper of deep reinforcement learning, and it sits at a crossroads in the Library. It marries two earlier currents — neural networks that learn to see (the line that runs through AlexNet) and reinforcement learning's century-old question of how a creature learns from reward. The agents that followed, including AlphaGo's mastery of the board game Go, all trace back to this experiment with a joystick and a screen.
We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.
We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.
Furthermore, in RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.
The optimal action-value function obeys an important identity known as the Bellman equation.
By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.
The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units.
So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them.