人工智能 1989

将反向传播应用于手写邮政编码识别

扬·勒丘恩等（AT&T 贝尔实验室）

网络直接从像素学会读数字——而诀窍，是权重共享。

Choose your version

In depth · the introduction

美国邮政有个麻烦：机器读不出人们潦草写在信封上的邮政编码。这篇论文教会了一台机器去读——靠的是让它自己从像素里学。

把这个想法拆开看

更早的手写识别分两步走：先由人类专家手工设计要寻找的特征（这里一个角、那里一道边），然后才让程序去给结果分类。这篇论文把「手工设计」那头一步扔掉了。它把数字的原始 16×16 图像直接喂进一个单独的神经网络，让网络自己去发现该寻找什么。

巧妙之处，在于网络「看」的方式。它不给每个像素配一套自己的连接，而是用一个小小的检测器——一块 5×5 的权重——把同一个检测器滑过整张图像。一道竖直的笔画，无论在左在右看起来都一样，何必学两遍？把同一组权重在每个位置反复使用，就叫作「权重共享」，正是这个诀窍，让整件事跑了起来。

它从哪里来

1980 年代末，一位名叫扬·勒丘恩（Yann LeCun）的年轻法国研究者，加入了传奇般的 AT&T 贝尔实验室；他们七个人，去啃一个实打实、白花花的任务——这是美国邮政交来的：读出真实信件上手写的邮政编码。团队手里有近一万张数字图像，是从经过纽约州布法罗邮局的信封上数字化下来的——脏、斜、糊，出自成千上万双不同的手。

反向传播——让网络从自己的错误中学习的方法——三年前才由 Rumelhart、Hinton 与 Williams 发表（本馆亦有收录）。勒丘恩的团队证明，它可以被指向一个真实的工业问题，而不只是玩具；而且它能训练一个直接盯着像素看的网络。

它为何重要

把关于图像的知识——局部检测器、四处复用——筑进架构里，网络要学的数字便远比天真的设计少（约 9,760 个）。要学的数字越少，所需的数据越少，对从未见过的样本也猜得越准。最终的结果——在真正难认的手写上 5% 的错误率——好到足以跑在廉价硬件上，每秒读十多个数字、处理真实邮件。这证明了：一个网络可以学会「看」。

一枚会学习的橡皮图章

想象你要在一页纸上的任意位置，找出某个特定的形状——比如一道短短的斜线。你可以把它可能出现的每个地方都背下来，或者，刻一枚形如那道线的小橡皮图章，到处去盖，凡是盖得上的地方就做个记号。卷积网络走的是第二条路：它刻出一小把小图章（检测边缘、角、笔画的检测器），把每一枚都按遍整张图像。而与真图章不同，这些是学出来的——网络会一点点打磨它们的形状，直到它们恰好挑出那些把 2 和 7 区分开来的特征。

它所处的位置

这是一条链子上的第一环。反向传播（Rumelhart、Hinton 与 Williams，1986，本馆收录）给了它学习的方法；这篇 1989 年的论文，让那种学习学会了看图像；LeNet-5（1998）为读银行支票把它打磨得更精；而 AlexNet（2012，本馆亦有）把这套配方在强大的图形芯片与百万张网络照片上放大开来，掀起了现代这波 AI 浪潮。每当你的手机按人脸把照片归类、你的汽车认出一块停车标志，你看到的，都是这个网络的曾孙。

The original document

Original source text

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel · Neural Computation 1(4):541–551 (1989) · communicated by Dana Ballard

Abstract

The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network.

The abstract goes on: the approach "has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification."

§2 · The data base

The data base "consists of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office." 7291 examples are used for training and 2007 for testing; each digit is normalized to a 16×16 grayscale image with gray levels scaled to fall within the range −1 to 1.

§3 · Network design — feature maps and weight sharing

In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations.

Three hidden layers (H1, H2, H3) feed a 10-unit output with place coding. H1 holds 12 feature maps of 8×8 units, each unit reading a 5×5 neighbourhood through a shared kernel — "a nonlinear subsampled convolution with a 5 by 5 kernel." In total the network has "1256 units, 64,660 connections, and 9760 independent parameters."

§4–5 · Training and results

Nodes use a scaled hyperbolic tangent; the cost is mean squared error; weights are updated by stochastic ("on-line") gradient with a diagonal-Hessian variant of Newton's method. "The network was trained for 23 passes through the training set (167,693 pattern presentations)."

The percentage of misclassified patterns was 0.14% on the training set (10 mistakes) and 5.0% on the test set (102 mistakes).

[ … ]

§5.1 · Comparison with other work

This "constrained backpropagation" is the key to success of the present system: it not only builds in shift-invariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters.

§6 · Conclusion

We have successfully applied backpropagation learning to a large, real-world task. Our results appear to be at the state of the art in digit recognition.

The final network ran on a commercial AT&T DSP-32C signal processor at more than 10 classifications per second, camera to label. The full paper — with its architecture diagram, the synthesized kernels, and the error-versus-passes curves — runs to eleven pages and is available in full at the source below.

AT&T Bell Laboratories, Holmdel · 1989