Artificial Intelligence 1989

Backpropagation Applied to Handwritten Zip Code Recognition

Yann LeCun et al. (AT&T Bell Labs)

A net learns to read digits straight from pixels — and weight sharing is the trick.

Choose your version

In depth · the introduction

The U.S. mail had a problem: machines couldn't read the zip codes people scrawled on envelopes. This paper taught one to do it — by letting it learn from the pixels themselves.

The idea, unpacked

Earlier handwriting readers worked in two steps: a human expert hand-designed features to look for (a corner here, an edge there), and only then did a program classify the result. This paper threw out the hand-designed first step. It fed raw 16×16 images of digits straight into a single neural network and let the network discover, on its own, what to look for.

The clever part is how the network looks. Instead of giving every pixel its own private set of connections, it uses one small detector — a 5×5 patch of weights — and slides that same detector across the whole image. A vertical stroke looks the same whether it's on the left or the right, so why learn it twice? Reusing the same weights everywhere is called weight sharing, and it is the trick that made the whole thing work.

Where it came from

In the late 1980s, a young French researcher named Yann LeCun joined the legendary AT&T Bell Labs, where seven of them tackled a concrete, money-on-the-table task handed over by the U.S. Postal Service: read the handwritten zip codes on real mail. The team had nearly ten thousand digit images digitized from envelopes that passed through the Buffalo, New York post office — messy, slanted, smudged, written by thousands of different hands.

Backpropagation — the method that lets a network learn from its mistakes — had been published just three years earlier by Rumelhart, Hinton and Williams (also in this Library). LeCun's team showed it could be pointed at a real industrial problem, not just a toy, and that it could train a network looking directly at pixels.

Why it mattered

By building knowledge about images into the architecture — local detectors, reused everywhere — the network needed far fewer numbers to learn (about 9,760) than a naive design would. Fewer numbers to learn means less data needed and better guessing on examples it had never seen. The result, 5% errors on genuinely hard handwriting, was good enough to run on cheap hardware and read real mail at more than ten digits a second. It was proof that a network could learn to see.

A rubber stamp that learns

Imagine looking for a particular shape — say, a short diagonal line — anywhere on a page. You could memorize every spot it might appear, or you could cut one small rubber stamp shaped like that line and press it everywhere, marking each place it fits. The convolutional network is the second way: it carves a handful of little stamps (detectors for edges, corners, strokes) and presses each one across the whole image. And unlike a real stamp, these are learned — the network files down their shape until they pick out exactly the features that tell a 2 from a 7.

Where it sits

This is the first link in a chain. Backpropagation (Rumelhart, Hinton & Williams, 1986, in this Library) gave it the way to learn; this 1989 paper made that learning see images; LeNet-5 (1998) refined it for bank checks; and AlexNet (2012, also here) blew the recipe up onto powerful graphics chips and a million internet photos, kicking off the modern wave of AI. Every time your phone groups your photos by face or your car spots a stop sign, you are watching a great-grandchild of this network.

The original document

Original source text

Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, L. D. Jackel · Neural Computation 1(4):541–551 (1989) · communicated by Dana Ballard

Abstract

The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network.

The abstract goes on: the approach "has been successfully applied to the recognition of handwritten zip code digits provided by the U.S. Postal Service. A single network learns the entire recognition operation, going from the normalized image of the character to the final classification."

§2 · The data base

The data base "consists of 9298 segmented numerals digitized from handwritten zip codes that appeared on U.S. mail passing through the Buffalo, NY post office." 7291 examples are used for training and 2007 for testing; each digit is normalized to a 16×16 grayscale image with gray levels scaled to fall within the range −1 to 1.

§3 · Network design — feature maps and weight sharing

In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations.

Three hidden layers (H1, H2, H3) feed a 10-unit output with place coding. H1 holds 12 feature maps of 8×8 units, each unit reading a 5×5 neighbourhood through a shared kernel — "a nonlinear subsampled convolution with a 5 by 5 kernel." In total the network has "1256 units, 64,660 connections, and 9760 independent parameters."

§4–5 · Training and results

Nodes use a scaled hyperbolic tangent; the cost is mean squared error; weights are updated by stochastic ("on-line") gradient with a diagonal-Hessian variant of Newton's method. "The network was trained for 23 passes through the training set (167,693 pattern presentations)."

The percentage of misclassified patterns was 0.14% on the training set (10 mistakes) and 5.0% on the test set (102 mistakes).

[ … ]

§5.1 · Comparison with other work

This "constrained backpropagation" is the key to success of the present system: it not only builds in shift-invariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters.

§6 · Conclusion

We have successfully applied backpropagation learning to a large, real-world task. Our results appear to be at the state of the art in digit recognition.

The final network ran on a commercial AT&T DSP-32C signal processor at more than 10 classifications per second, camera to label. The full paper — with its architecture diagram, the synthesized kernels, and the error-versus-passes curves — runs to eleven pages and is available in full at the source below.

AT&T Bell Laboratories, Holmdel · 1989