人工智慧 2013

用深度強化學習玩雅達利遊戲

沃洛季米爾·姆尼赫等（DeepMind）

只憑原始像素與一個分數，一張網路學會了遊戲——DQN 時代由此開始。

Choose your version

In depth · the introduction

只給電腦一塊雅達利螢幕上跳動的像素，加一個分數——沒有規則，沒有攻略——牠便自己學會了玩。正是這個實驗，點燃了現代 AI 智能體的引信。

把這個想法拆開看

當時大多數 AI 是從帶標籤的例子裡學的：這是一張照片，這是「貓」這個詞。可沒人能逐幀標出一段遊戲畫面裡「正確的那一步」。這項工作用的是另一種學習，叫強化學習：程式去嘗試，看自己的分數升或降，再漸漸弄明白哪些動作往後能帶來回報。

那一躍，是把原始螢幕——純粹的像素——直接餵給牠，讓一張神經網路自己去發現：牠在看什麼，又該拿它怎麼辦。同一個程式，遊戲之間什麼都不改，學會了玩七款不同的雅達利遊戲。它被命名為 DQN，即「深度 Q-網路」。

它從哪裡來

2013 年 12 月，倫敦一家名叫 DeepMind 的公司——不久後即被 Google 收購——的一支小團隊，向一個研討會投出了一篇九頁的論文。它用到的零件並不新：那條叫 Q-學習的試錯規則可追溯到 1992 年，「回放過往經驗」的技巧則到 1993 年。新的，是讓它們在原始視覺上協同起來——而早先的嘗試，往往會失控打轉。

他們的修法樸素得讓人意外。與其在每一刻發生時就地學習——那裡一幀幾乎和下一幀一模一樣，教訓全糊在一起——他們把智能體的經歷存進一段記憶，再以打亂、隨機的小批回放。僅這一改，加上幾處務實的微調，就把學習穩住了。兩年後，同一支團隊，把他們的智能體送上了《自然》的封面。

它為何重要

在此之前，一個能看、能行動的 AI，通常需要人來手工設計：畫面裡哪些特徵要緊。DQN 表明，這一步可以跳過：把原始畫面和一個獎勵交給機器，剩下的牠自己會弄明白。這讓這套配方變得通用。完全相同的想法——從原始感知和一個分數中學會行動——從會下棋的 AlphaGo，到學會抓取物體的機器人，一路通向那把原始語言模型變成樂於助人的聊天機器人的訓練環節。

一幅日常的畫面

想像你在靜音、又沒有說明書的情況下學一款新遊戲。一開始你只是亂按鍵。但你注意到，某些動作之後過幾秒，分數會往上跳——於是你開始多做那些動作。一點一點地，你把分數往回追溯到掙得它的那些動作。這種「功勞」從獎勵向促成它的動作回流的過程，正是下方的智能體在一小塊格子上所做的事。拖動滑桿，看好的著法一一亮起來。

它的位置

這是深度強化學習的奠基之作，在本館裡立於一處十字路口。它把兩股更早的潮流連了起來——學會「看」的神經網路（貫穿 AlexNet 的那條線），與強化學習那個延續了一個世紀的問題：一個生物如何從獎勵中學習。其後的眾多智能體，包括 AlphaGo 對圍棋的掌握，都能追溯回這個用操縱桿與螢幕做的實驗。

The original document

Original source text

V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller · DeepMind Technologies · NIPS Deep Learning Workshop 2013 (arXiv:1312.5602, 19 Dec 2013)

Abstract

We present the first deep learning model to successfully learn control policies directly from high-dimensional sensory input using reinforcement learning. The model is a convolutional neural network, trained with a variant of Q-learning, whose input is raw pixels and whose output is a value function estimating future rewards.

The abstract then states that the same architecture and learning algorithm — with no per-game adjustment — is applied to seven Atari 2600 games from the Arcade Learning Environment.

We find that it outperforms all previous approaches on six of the games and surpasses a human expert on three of them.

Introduction — the obstacles

The introduction sets out why reinforcement learning had not yet conquered raw vision: rewards are sparse, noisy and delayed; consecutive observations are highly correlated rather than independent; and the data distribution itself shifts as the agent's behaviour changes.

Furthermore, in RL the data distribution changes as the algorithm learns new behaviours, which can be problematic for deep learning methods that assume a fixed underlying distribution.

Background — the Bellman equation

The optimal action-value function obeys the Bellman equation: if the value of the next state were known for every possible action, the optimal strategy is to select the action maximising the expected value of the immediate reward plus the discounted value of what follows.

The optimal action-value function obeys an important identity known as the Bellman equation.

Deep Reinforcement Learning — experience replay

The central trick is experience replay: each transition is stored in a replay memory, and learning updates are drawn at random from that pool, breaking the correlations between consecutive samples and smoothing the training distribution.

By using experience replay the behavior distribution is averaged over many of its previous states, smoothing out learning and avoiding oscillations or divergence in the parameters.

Preprocessing and architecture

The first hidden layer convolves 16 8 × 8 filters with stride 4 with the input image and applies a rectifier nonlinearity. The second hidden layer convolves 32 4 × 4 filters with stride 2, again followed by a rectifier nonlinearity. The final hidden layer is fully-connected and consists of 256 rectifier units.

Raw 210 × 160 frames are converted to grey-scale, down-sampled and cropped to 84 × 84, and the last four frames are stacked into the network's input. A single output unit gives the predicted Q-value for each action — so one forward pass scores every action at once.

[ … ]

Experiments — results

Across all seven games — Beam Rider, Breakout, Enduro, Pong, Q*bert, Seaquest and Space Invaders — the same network, trained with RMSProp on minibatches of 32 over 10 million frames with an ε-greedy policy annealed from 1.0 to 0.1, learns purely from pixels and the score.

So far the network has outperformed all previous RL algorithms on six of the seven games we have attempted and surpassed an expert human player on three of them.

DeepMind · December 2013