人工智慧 2017

注意力就是你所需要的一切

阿希什·瓦斯瓦尼等（Google）

拋開循環結構，讓每個詞都直接「看見」其餘每個詞。Transformer，就此誕生。

Choose your version

In depth · the introduction

正是這套設計，讓 AI 能一次性把每個詞與其餘每個詞相互掂量——它，成了今天聊天機器人的根基。

把這個想法拆開看

更早的語言 AI 是按順序讀句子的，一次一個詞，努力記住前面讀過什麼。這讓牠們訓練起來很慢，而且讀長段落時容易「健忘」——讀到一段話的末尾，開頭已經半忘了。Transformer 把這種「一步一步讀」的方式徹底扔掉了。

它的核心想法，叫作「注意力」。模型不再按順序讀，而是把所有詞放在一起看，並為每一個詞，判斷其餘每一個詞對它的意思有多重要。所有這些比較，是同時、而非一個接一個地完成的——這正是它訓練得如此之快、又如此容易被造得巨大的原因。

它從哪裡來

2017 年，Google 的八位研究者，在 NeurIPS 會議上發表了一篇八頁的論文，標題俏皮——《注意力就是你所需要的一切》。它瞄準的是機器翻譯，僅在這一狹窄任務上，它就已經擊敗了當時最好的系統。但它真正的分量，在於它引入的那套架構——Transformer——這套架構以驚人的速度傳遍整個領域，短短一兩年內，便成了語言 AI 預設的骨架。

它為何重要

去掉了那個緩慢的、串行的步驟，Transformer 便能以此前任何東西都無法企及的規模來訓練。這解鎖了一個簡單而強大的循環：把模型做大，餵給它更多文本，它就變得更聰明。再做大，再餵更多——一遍又一遍，這個循環，就是整個大語言模型時代，而這篇論文，繪下了它的藍圖。隨後到來的聊天機器人、寫作工具、翻譯器與助手，全都建在它之上。

一個小小的例子

拿「獎盃放不進手提箱，因為它太大了」這句話來說。「它」指的是什麼——獎盃，還是手提箱？你一瞬間就知道是獎盃，可電腦卻得自己去弄明白。注意力，正是模型做到這一點的方式：它讓「它」這個詞環視整句話，並把最大的分量，壓在「獎盃」上。在下方，親手試試同樣的事。

之後發生了什麼

直接從這篇論文裡長出來的兩大家族，後來都成了家喻戶曉的名字。BERT 用 Transformer 來閱讀、理解文本；GPT 系列則用它來生成文本，並長成了人們如今每天與之交談的那些助手。幾乎每一個你用過的、與語言打交道的 AI，其設計都能追溯回這八頁紙。

The original document

Original source text

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin · NeurIPS 30 (2017)

Abstract

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

The remainder of the abstract reports that the model is both more parallelizable and faster to train, and that it sets a new state of the art on two WMT 2014 machine-translation tasks.

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The paper then defines Scaled Dot-Product Attention and Multi-Head Attention, and explains how the same mechanism is reused in the encoder, the decoder, and the cross-attention between them.

Self-attention

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

A table compares self-attention to recurrent and convolutional layers on computational complexity, the amount of parallel work, and the maximum path length between any two positions.

[ … ]

Results

Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results … by over 2 BLEU.

The full paper, with its architecture diagram, the attention-visualisation figures, and the complete tables of BLEU scores and training cost, runs to eight pages and is available in full at the source below.

Google · June 2017