人工智能 2017

注意力就是你所需要的一切

阿希什·瓦斯瓦尼等（谷歌）

抛开循环结构，让每个词都直接「看见」其余每个词。Transformer，就此诞生。

Choose your version

In depth · the introduction

正是这套设计，让 AI 能一次性把每个词与其余每个词相互掂量——它，成了今天聊天机器人的根基。

把这个想法拆开看

更早的语言 AI 是按顺序读句子的，一次一个词，努力记住前面读过什么。这让它们训练起来很慢，而且读长段落时容易「健忘」——读到一段话的末尾，开头已经半忘了。Transformer 把这种「一步一步读」的方式彻底扔掉了。

它的核心想法，叫作「注意力」。模型不再按顺序读，而是把所有词放在一起看，并为每一个词，判断其余每一个词对它的意思有多重要。所有这些比较，是同时、而非一个接一个地完成的——这正是它训练得如此之快、又如此容易被造得巨大的原因。

它从哪里来

2017 年，谷歌的八位研究者，在 NeurIPS 会议上发表了一篇八页的论文，标题俏皮——《注意力就是你所需要的一切》。它瞄准的是机器翻译，仅在这一狭窄任务上，它就已经击败了当时最好的系统。但它真正的分量，在于它引入的那套架构——Transformer——这套架构以惊人的速度传遍整个领域，短短一两年内，便成了语言 AI 默认的骨架。

它为何重要

去掉了那个缓慢的、串行的步骤，Transformer 便能以此前任何东西都无法企及的规模来训练。这解锁了一个简单而强大的循环：把模型做大，喂给它更多文本，它就变得更聪明。再做大，再喂更多——一遍又一遍，这个循环，就是整个大语言模型时代，而这篇论文，绘下了它的蓝图。随后到来的聊天机器人、写作工具、翻译器与助手，全都建在它之上。

一个小小的例子

拿「奖杯放不进手提箱，因为它太大了」这句话来说。「它」指的是什么——奖杯，还是手提箱？你一瞬间就知道是奖杯，可计算机却得自己去弄明白。注意力，正是模型做到这一点的方式：它让「它」这个词环视整句话，并把最大的分量，压在「奖杯」上。在下方，亲手试试同样的事。

之后发生了什么

直接从这篇论文里长出来的两大家族，后来都成了家喻户晓的名字。BERT 用 Transformer 来阅读、理解文本；GPT 系列则用它来生成文本，并长成了人们如今每天与之交谈的那些助手。几乎每一个你用过的、与语言打交道的 AI，其设计都能追溯回这八页纸。

The original document

Original source text

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin · NeurIPS 30 (2017)

Abstract

We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

The remainder of the abstract reports that the model is both more parallelizable and faster to train, and that it sets a new state of the art on two WMT 2014 machine-translation tasks.

Attention

An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors.

The paper then defines Scaled Dot-Product Attention and Multi-Head Attention, and explains how the same mechanism is reused in the encoder, the decoder, and the cross-attention between them.

Self-attention

Self-attention, sometimes called intra-attention, is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

A table compares self-attention to recurrent and convolutional layers on computational complexity, the amount of parallel work, and the maximum path length between any two positions.

[ … ]

Results

Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results … by over 2 BLEU.

The full paper, with its architecture diagram, the attention-visualisation figures, and the complete tables of BLEU scores and training cost, runs to eight pages and is available in full at the source below.

Google · June 2017