人工智慧 1989

將反向傳播應用於手寫郵遞區號辨識

揚·勒丘恩等（AT&T 貝爾實驗室）

網路直接從像素學會讀數字——而訣竅，是權重共享。

Choose your version

In depth · the introduction

美國郵政有個麻煩：機器讀不出人們潦草寫在信封上的郵遞區號。這篇論文教會了一台機器去讀——靠的是讓它自己從像素裡學。

把這個想法拆開看

更早的手寫辨識分兩步走：先由人類專家手工設計要尋找的特徵（這裡一個角、那裡一道邊），然後才讓程式去給結果分類。這篇論文把「手工設計」那頭一步扔掉了。它把數字的原始 16×16 影像直接餵進一個單獨的神經網路，讓網路自己去發現該尋找什麼。

巧妙之處，在於網路「看」的方式。它不給每個像素配一套自己的連接，而是用一個小小的偵測器——一塊 5×5 的權重——把同一個偵測器滑過整張影像。一道豎直的筆畫，無論在左在右看起來都一樣，何必學兩遍？把同一組權重在每個位置反覆使用，就叫作「權重共享」，正是這個訣竅，讓整件事跑了起來。

它從哪裡來

1980 年代末，一位名叫揚·勒丘恩（Yann LeCun）的年輕法國研究者，加入了傳奇般的 AT&T 貝爾實驗室；他們七個人，去啃一個實打實、白花花的任務——這是美國郵政交來的：讀出真實信件上手寫的郵遞區號。團隊手裡有近一萬張數字影像，是從經過紐約州水牛城郵局的信封上數位化下來的——髒、斜、糊，出自成千上萬雙不同的手。

反向傳播——讓網路從自己的錯誤中學習的方法——三年前才由 Rumelhart、Hinton 與 Williams 發表（本館亦有收錄）。勒丘恩的團隊證明，它可以被指向一個真實的工業問題，而不只是玩具；而且它能訓練一個直接盯著像素看的網路。

它為何重要

把關於影像的知識——局部偵測器、四處複用——築進架構裡，網路要學的數字便遠比天真的設計少（約 9,760 個）。要學的數字越少，所需的資料越少，對從未見過的樣本也猜得越準。最終的結果——在真正難認的手寫上 5% 的錯誤率——好到足以跑在廉價硬體上，每秒讀十多個數字、處理真實郵件。這證明了：一個網路可以學會「看」。

一枚會學習的橡皮圖章

想像你要在一頁紙上的任意位置，找出某個特定的形狀——比如一道短短的斜線。你可以把它可能出現的每個地方都背下來，或者，刻一枚形如那道線的小橡皮圖章，到處去蓋，凡是蓋得上的地方就做個記號。卷積網路走的是第二條路：它刻出一小把小圖章（偵測邊緣、角、筆畫的偵測器），把每一枚都按遍整張影像。而與真圖章不同，這些是學出來的——網路會一點點打磨它們的形狀，直到它們恰好挑出那些把 2 和 7 區分開來的特徵。

它所處的位置

這是一條鏈子上的第一環。反向傳播（Rumelhart、Hinton 與 Williams，1986，本館收錄）給了它學習的方法；這篇 1989 年的論文，讓那種學習學會了看影像；LeNet-5（1998）為讀銀行支票把它打磨得更精；而 AlexNet（2012，本館亦有）把這套配方在強大的圖形晶片與百萬張網路照片上放大開來，掀起了現代這波 AI 浪潮。每當你的手機按人臉把照片歸類、你的汽車認出一塊停車標誌，你看到的，都是這個網路的曾孫。

The original document

Original source text

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel · Neural Computation 1(4):541–551 (1989) · communicated by Dana Ballard

Abstract

The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network.

The abstract goes on: the approach "has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification."

§2 · The data base

The data base "consists of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office." 7291 examples are used for training and 2007 for testing; each digit is normalized to a 16×16 grayscale image with gray levels scaled to fall within the range −1 to 1.

§3 · Network design — feature maps and weight sharing

In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations.

Three hidden layers (H1, H2, H3) feed a 10-unit output with place coding. H1 holds 12 feature maps of 8×8 units, each unit reading a 5×5 neighbourhood through a shared kernel — "a nonlinear subsampled convolution with a 5 by 5 kernel." In total the network has "1256 units, 64,660 connections, and 9760 independent parameters."

§4–5 · Training and results

Nodes use a scaled hyperbolic tangent; the cost is mean squared error; weights are updated by stochastic ("on-line") gradient with a diagonal-Hessian variant of Newton's method. "The network was trained for 23 passes through the training set (167,693 pattern presentations)."

The percentage of misclassified patterns was 0.14% on the training set (10 mistakes) and 5.0% on the test set (102 mistakes).

[ … ]

§5.1 · Comparison with other work

This "constrained backpropagation" is the key to success of the present system: it not only builds in shift-invariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters.

§6 · Conclusion

We have successfully applied backpropagation learning to a large, real-world task. Our results appear to be at the state of the art in digit recognition.

The final network ran on a commercial AT&T DSP-32C signal processor at more than 10 classifications per second, camera to label. The full paper — with its architecture diagram, the synthesized kernels, and the error-versus-passes curves — runs to eleven pages and is available in full at the source below.

AT&T Bell Laboratories, Holmdel · 1989