人工智能 2012

用深度卷积神经网络进行 ImageNet 图像分类

亚历克斯·克里热夫斯基、伊利亚·苏茨克韦尔与杰弗里·辛顿

一个深度网络，用 GPU 在堆积如山的图像上训练，看得比此前任何程序都准。

Choose your version

In depth · the introduction

一个深度神经网络，被喂入逾百万张照片、并在游戏显卡上训练，学会了识别物体，远胜此前任何程序——并就此点燃了我们正身处其中的这场 AI 热潮。

把这个想法拆开看

数十年来，要让计算机从一张照片里分辨出狗和猫，难得令人头疼。程序员手工编写关于边缘和形状的种种规则，效果却平平。这篇论文走了一条不同的路：别去写规则——让机器从例子中自己学会它们，海量的例子。

研究团队在逾百万张带标注的照片上，训练了一个「深度」神经网络——一摞结构简单、灵感大致来自大脑神经元的层。最初的几层学会留意细小的特征，比如一条边缘、一块颜色；更深的层把它们组合成形状，再到部件，最后到整个物体。其中巧妙的工程，在于动用了显卡（那些为电子游戏打造的芯片）来飞快地完成数量惊人的求和，再加上几个让这么大的网络保持良好学习状态的小技巧。结果不是略微领先——它把此前的一切远远甩在身后。

它从哪里来

2012 年，两位研究生——亚历克斯·克里热夫斯基与伊利亚·苏茨克韦尔——在多伦多大学与杰弗里·辛顿一道，报名参加了 ImageNet 竞赛：一项每年举办、把照片归入一千个类别的比赛。他们的网络，很快被昵称为 AlexNet，不只是赢了；它赢的差距之大，以至于短短几个月内，整个领域便抛弃了旧方法，转投深度学习。它常被称作现代 AI 时代的「大爆炸」。

它为何重要

这是一次公开的、大规模的证明：从数据中学习，胜过手写规则——而且，只要你加更多数据、更多算力，这套方法就会持续变好。今天你用到的几乎每一个 AI，从照片标注、语音助手，到翻译、再到后来出现的聊天机器人，都可把血脉追溯到这个网络夺冠的那一刻。

网络是怎么「看」的

卷积网络透过一扇小小的滑动窗口来看一张图。每一扇窗口里跑着一个小小的滤波器——一格一格的数字——当它撞见某种特定的图案（比如一条竖直的边缘、或一块单一的颜色）时，就会亮起来。把这个滤波器滑遍整张图，你就得到一张「特征图」，显示那种图案出现在哪里。把这些一层层叠起来，网络便从边缘、到形状、再到物体，层层搭建。在下方，亲手滑一个滤波器试试。

你在哪里遇见过它

AlexNet 的后裔，遍布于一切「你让计算机去看图像」的地方：你手机上的人脸解锁，在相册里找出「狗的照片」的搜索，标记肿瘤的医学扫描仪，以及自动驾驶汽车上的摄像头。同样这套大方向的配方——一个深度网络、大量数据、大量算力——也支撑着那些听懂语音、读懂语言的 AI。

The original document

Original source text

A. Krizhevsky, I. Sutskever, G. E. Hinton · NeurIPS 25 (2012)

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

The network and its result

On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

Making it train

To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.

Conclusion

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning.

All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

The full paper details the architecture layer by layer, the data augmentation and dropout regularization, the split across two GPUs, and the competition results; it runs to nine pages and is available in full at the source below.

University of Toronto · 2012