Backpropagation Applied to Handwritten Zip Code Recognition
A net learns to read digits straight from pixels — and weight sharing is the trick.
The U.S. mail had a problem: machines couldn't read the zip codes people scrawled on envelopes. This paper taught one to do it — by letting it learn from the pixels themselves.
The idea, unpacked
Earlier handwriting readers worked in two steps: a human expert hand-designed features to look for (a corner here, an edge there), and only then did a program classify the result. This paper threw out the hand-designed first step. It fed raw 16×16 images of digits straight into a single neural network and let the network discover, on its own, what to look for.
The clever part is how the network looks. Instead of giving every pixel its own private set of connections, it uses one small detector — a 5×5 patch of weights — and slides that same detector across the whole image. A vertical stroke looks the same whether it's on the left or the right, so why learn it twice? Reusing the same weights everywhere is called weight sharing, and it is the trick that made the whole thing work.
Where it came from
In the late 1980s, a young French researcher named Yann LeCun joined the legendary AT&T Bell Labs, where seven of them tackled a concrete, money-on-the-table task handed over by the U.S. Postal Service: read the handwritten zip codes on real mail. The team had nearly ten thousand digit images digitized from envelopes that passed through the Buffalo, New York post office — messy, slanted, smudged, written by thousands of different hands.
Backpropagation — the method that lets a network learn from its mistakes — had been published just three years earlier by Rumelhart, Hinton and Williams (also in this Library). LeCun's team showed it could be pointed at a real industrial problem, not just a toy, and that it could train a network looking directly at pixels.
Why it mattered
By building knowledge about images into the architecture — local detectors, reused everywhere — the network needed far fewer numbers to learn (about 9,760) than a naive design would. Fewer numbers to learn means less data needed and better guessing on examples it had never seen. The result, 5% errors on genuinely hard handwriting, was good enough to run on cheap hardware and read real mail at more than ten digits a second. It was proof that a network could learn to see.
A rubber stamp that learns
Imagine looking for a particular shape — say, a short diagonal line — anywhere on a page. You could memorize every spot it might appear, or you could cut one small rubber stamp shaped like that line and press it everywhere, marking each place it fits. The convolutional network is the second way: it carves a handful of little stamps (detectors for edges, corners, strokes) and presses each one across the whole image. And unlike a real stamp, these are learned — the network files down their shape until they pick out exactly the features that tell a 2 from a 7.
Where it sits
This is the first link in a chain. Backpropagation (Rumelhart, Hinton & Williams, 1986, in this Library) gave it the way to learn; this 1989 paper made that learning see images; LeNet-5 (1998) refined it for bank checks; and AlexNet (2012, also here) blew the recipe up onto powerful graphics chips and a million internet photos, kicking off the modern wave of AI. Every time your phone groups your photos by face or your car spots a stop sign, you are watching a great-grandchild of this network.
The ability of learning networks to generalize can be greatly enhanced by providing constraints from the task domain. This paper demonstrates how such constraints can be integrated into a backpropagation network through the architecture of the network.
In our case, the first hidden layer is composed of several planes that we call feature maps. All units in a plane share the same set of weights, thereby detecting the same feature at different locations.
The percentage of misclassified patterns was 0.14% on the training set (10 mistakes) and 5.0% on the test set (102 mistakes).
This "constrained backpropagation" is the key to success of the present system: it not only builds in shift-invariance, but vastly reduces the entropy, the Vapnik-Chervonenkis dimensionality, and the number of free parameters.
We have successfully applied backpropagation learning to a large, real-world task. Our results appear to be at the state of the art in digit recognition.