人工智慧 2012

用深度卷積神經網路進行 ImageNet 影像分類

亞歷克斯·克里熱夫斯基、伊利亞·蘇茨克韋爾與傑佛瑞·辛頓

一個深度網路，用 GPU 在堆積如山的影像上訓練，看得比此前任何程式都準。

Choose your version

In depth · the introduction

一個深度神經網路，被餵入逾百萬張照片、並在遊戲顯卡上訓練，學會了辨識物體，遠勝此前任何程式——並就此點燃了我們正身處其中的這場 AI 熱潮。

把這個想法拆開看

數十年來，要讓電腦從一張照片裡分辨出狗和貓，難得令人頭疼。程式設計師手工編寫關於邊緣和形狀的種種規則，效果卻平平。這篇論文走了一條不同的路：別去寫規則——讓機器從例子中自己學會牠們，海量的例子。

研究團隊在逾百萬張帶標註的照片上，訓練了一個「深度」神經網路——一摞結構簡單、靈感大致來自大腦神經元的層。最初的幾層學會留意細小的特徵，比如一條邊緣、一塊顏色；更深的層把牠們組合成形狀，再到部件，最後到整個物體。其中巧妙的工程，在於動用了顯卡（那些為電子遊戲打造的晶片）來飛快地完成數量驚人的求和，再加上幾個讓這麼大的網路保持良好學習狀態的小技巧。結果不是略微領先——它把此前的一切遠遠甩在身後。

它從哪裡來

2012 年，兩位研究生——亞歷克斯·克里熱夫斯基與伊利亞·蘇茨克韋爾——在多倫多大學與傑佛瑞·辛頓一道，報名參加了 ImageNet 競賽：一項每年舉辦、把照片歸入一千個類別的比賽。他們的網路，很快被暱稱為 AlexNet，不只是贏了；它贏的差距之大，以至於短短幾個月內，整個領域便拋棄了舊方法，轉投深度學習。它常被稱作現代 AI 時代的「大爆炸」。

它為何重要

這是一次公開的、大規模的證明：從數據中學習，勝過手寫規則——而且，只要你加更多數據、更多算力，這套方法就會持續變好。今天你用到的幾乎每一個 AI，從照片標註、語音助手，到翻譯、再到後來出現的聊天機器人，都可把血脈追溯到這個網路奪冠的那一刻。

網路是怎麼「看」的

卷積網路透過一扇小小的滑動窗口來看一張圖。每一扇窗口裡跑著一個小小的濾波器——一格一格的數字——當它撞見某種特定的圖案（比如一條豎直的邊緣、或一塊單一的顏色）時，就會亮起來。把這個濾波器滑遍整張圖，你就得到一張「特徵圖」，顯示那種圖案出現在哪裡。把這些一層層疊起來，網路便從邊緣、到形狀、再到物體，層層搭建。在下方，親手滑一個濾波器試試。

你在哪裡遇見過它

AlexNet 的後裔，遍布於一切「你讓電腦去看影像」的地方：你手機上的人臉解鎖，在相簿裡找出「狗的照片」的搜尋，標記腫瘤的醫學掃描儀，以及自動駕駛汽車上的攝影機。同樣這套大方向的配方——一個深度網路、大量數據、大量算力——也支撐著那些聽懂語音、讀懂語言的 AI。

The original document

Original source text

A. Krizhevsky, I. Sutskever, G. E. Hinton · NeurIPS 25 (2012)

Abstract

We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes.

The network and its result

On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

Making it train

To make training faster, we used non-saturating neurons and a very efficient GPU implementation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called “dropout” that proved to be very effective.

Conclusion

Our results show that a large, deep convolutional neural network is capable of achieving record-breaking results on a highly challenging dataset using purely supervised learning.

All of our experiments suggest that our results can be improved simply by waiting for faster GPUs and bigger datasets to become available.

The full paper details the architecture layer by layer, the data augmentation and dropout regularization, the split across two GPUs, and the competition results; it runs to nine pages and is available in full at the source below.

University of Toronto · 2012