人工智能 2013

用深度强化学习玩雅达利游戏

沃洛季米尔·姆尼赫等（DeepMind）

只凭原始像素与一个分数，一张网络学会了游戏——DQN 时代由此开始。

Choose your version

In depth · the introduction

只给计算机一块雅达利屏幕上跳动的像素，加一个分数——没有规则，没有攻略——它便自己学会了玩。正是这个实验，点燃了现代 AI 智能体的引信。

把这个想法拆开看

当时大多数 AI 是从带标签的例子里学的：这是一张照片，这是「猫」这个词。可没人能逐帧标出一段游戏画面里「正确的那一步」。这项工作用的是另一种学习，叫强化学习：程序去尝试，看自己的分数升或降，再渐渐弄明白哪些动作往后能带来回报。

那一跃，是把原始屏幕——纯粹的像素——直接喂给它，让一张神经网络自己去发现：它在看什么，又该拿它怎么办。同一个程序，游戏之间什么都不改，学会了玩七款不同的雅达利游戏。它被命名为 DQN，即「深度 Q-网络」。

它从哪里来

2013 年 12 月，伦敦一家名叫 DeepMind 的公司——不久后即被谷歌收购——的一支小团队，向一个研讨会投出了一篇九页的论文。它用到的零件并不新：那条叫 Q-学习的试错规则可追溯到 1992 年，「回放过往经验」的技巧则到 1993 年。新的，是让它们在原始视觉上协同起来——而早先的尝试，往往会失控打转。

他们的修法朴素得让人意外。与其在每一刻发生时就地学习——那里一帧几乎和下一帧一模一样，教训全糊在一起——他们把智能体的经历存进一段记忆，再以打乱、随机的小批回放。仅这一改，加上几处务实的微调，就把学习稳住了。两年后，同一支团队，把他们的智能体送上了《自然》的封面。

它为何重要

在此之前，一个能看、能行动的 AI，通常需要人来手工设计：画面里哪些特征要紧。DQN 表明，这一步可以跳过：把原始画面和一个奖励交给机器，剩下的它自己会弄明白。这让这套配方变得通用。完全相同的想法——从原始感知和一个分数中学会行动——从会下棋的 AlphaGo，到学会抓取物体的机器人，一路通向那把原始语言模型变成乐于助人的聊天机器人的训练环节。

一幅日常的画面

想象你在静音、又没有说明书的情况下学一款新游戏。一开始你只是乱按键。但你注意到，某些动作之后过几秒，分数会往上跳——于是你开始多做那些动作。一点一点地，你把分数往回追溯到挣得它的那些动作。这种「功劳」从奖励向促成它的动作回流的过程，正是下方的智能体在一小块格子上所做的事。拖动滑块，看好的着法一一亮起来。

它的位置

这是深度强化学习的奠基之作，在本馆里立于一处十字路口。它把两股更早的潮流连了起来——学会「看」的神经网络（贯穿 AlexNet 的那条线），与强化学习那个延续了一个世纪的问题：一个生物如何从奖励中学习。其后的众多智能体，包括 AlphaGo 对围棋的掌握，都能追溯回这个用操纵杆与屏幕做的实验。

The original document

Original source text

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller · DeepMind Technologies · NIPS Deep Learning Workshop 2013 (arXiv:1312.5602, 19 Dec 2013)

Abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

The abstract then states that the same architecture and learning algorithm — with no per-game adjustment — is applied to seven Atari 2600 games from the Arcade Learning Environment.

We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Introduction — the obstacles

The introduction sets out why reinforcement learning had not yet conquered raw vision: rewards are sparse, noisy and delayed; consecutive observations are highly correlated rather than independent; and the data distribution itself shifts as the agent's behaviour changes.

Furthermore, in RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

Background — the Bellman equation

The optimal action-value function obeys the Bellman equation: if the value of the next state were known for every possible action, the optimal strategy is to select the action maximising the expected value of the immediate reward plus the discounted value of what follows.

The optimal action-value function obeys an important identity known as the Bellman equation.

Deep Reinforcement Learning — experience replay

The central trick is experience replay: each transition is stored in a replay memory, and learning updates are drawn at random from that pool, breaking the correlations between consecutive samples and smoothing the training distribution.

By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

Preprocessing and architecture

The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units.

Raw 210 × 160 frames are converted to grey-scale, down-sampled and cropped to 84 × 84, and the last four frames are stacked into the network's input. A single output unit gives the predicted Q-value for each action — so one forward pass scores every action at once.

[ … ]

Experiments — results

Across all seven games — Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest and Space Invaders — the same network, trained with RMSProp on minibatches of 32 over 10 million frames with an ε-greedy policy annealed from 1.0 to 0.1, learns purely from pixels and the score.

So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them.

DeepMind · December 2013