Convolutional Networks: Seeing with Filters

The problem with treating an image as a flat list

You already know the multilayer perceptron: stack hidden layers of neurons, each one connected to every input. That works beautifully for a handful of features. But a modest photo is 224x224 pixels with 3 color channels — over 150,000 inputs. Connect that to even one layer of a few thousand neurons and you have hundreds of millions of parameters in the first layer alone.

Worse, that fully connected layer treats every pixel as unrelated to its neighbors. It would have to learn what a cat's ear looks like separately in the top-left corner, the middle, and the bottom-right — as if each location were a different world. A photo is not a flat list of independent numbers; it is a grid where nearby pixels belong together and where a pattern means the same thing wherever it appears. We want an architecture that knows this in advance.

Convolution: slide a tiny filter across the grid

Here is the core idea. Instead of one giant layer, take a tiny grid of weights — say 3x3 — called a filter or kernel. Lay it over the top-left 3x3 patch of the image, multiply each filter weight by the pixel under it, and add everything up into a single number. Then slide the filter one step to the right and repeat. This sliding-and-summing operation is the convolution.

The crucial trick is weight sharing: the same 3x3 filter is reused at every position. So a filter that has learned to fire on a vertical edge will detect that edge anywhere in the image, with just nine weights doing the work everywhere. This is how the architecture bakes in the assumption from the last section — and how it slashes the parameter count from millions to a few dozen per filter.

image patch        filter (learned)
[10 10 80]         [-1  0  1]
[10 10 80]    *    [-1  0  1]   -> one output number
[10 10 80]         [-1  0  1]

(-1*10 +0*10 +1*80) x3 rows = 210   <- big response: an edge is here

A 3x3 filter multiplies, sums, and reports a single number; a large value means the pattern it looks for is present.

Feature maps: a filter draws a map of where it fired

Slide one filter over the whole image and collect all its outputs back into a grid. That grid is a feature map — a picture of where, and how strongly, this particular filter responded. An edge-detecting filter produces a feature map that lights up along the edges. A layer does not use just one filter; it uses many (say 64) in parallel, so a single layer turns one image into a stack of 64 feature maps, each highlighting a different pattern.

After each convolution we apply a nonlinearity, almost always the ReLU, which simply zeroes out negative responses. Then comes the real magic: stack these layers. The first layer's filters learn edges and color blobs; the second layer sees the first layer's feature maps and learns corners and textures; deeper layers compose those into eyes, wheels, faces. Nobody hand-codes these stages — the network discovers them through backpropagation. This stacking is exactly the hierarchical representation learning this rung is about, now realized for images.

Stride, padding, and pooling: controlling size and zooming out

Two knobs control how the filter sweeps the grid, together called stride and padding. Stride is how far the filter jumps each step: stride 1 visits every position; stride 2 skips every other one, halving the output's width and height. Padding adds a border of zeros around the image so the filter can reach the very edges — without it, every convolution shrinks the map a little and corner pixels get under-counted.

To zoom out deliberately we use pooling. Max pooling, the common kind, slides a small window (often 2x2) and keeps only the largest value in each window, throwing the rest away. This shrinks the feature maps by half and gives a small, useful tolerance: if the detected pattern shifts by a pixel or two, the pooled output barely changes. Layer by layer, the spatial grid gets coarser while the number of feature maps grows — the network trades "where exactly" for "what, roughly."

Start: one 224x224 image with 3 color channels.
Convolve with 64 filters + ReLU -> 64 feature maps, each 224x224, all sharing tiny weights.
Pool 2x2 -> still 64 maps but now 112x112: half the resolution, slightly shift-tolerant.
Repeat, deeper layers using more filters on coarser grids, until a final classifier reads off the answer.

Why this fits images — and where the hype outruns the truth

Now the pieces fit. A grid is the right shape for a grid-shaped input. Weight sharing matches the fact that a cat's whisker looks the same wherever it sits. Local filters respect that meaning lives in neighborhoods of pixels. Pooling grants a little tolerance to small shifts. Together these gifts let a convolutional neural network reach strong image classification with orders of magnitude fewer parameters than a fully connected net — which is precisely why CNNs ignited the deep-learning era on vision around 2012.

Be honest about the limits. "Seeing with filters" is a metaphor: a CNN matches statistical patterns, it does not understand a scene. It can be fooled by tiny, deliberately crafted pixel changes invisible to you, and it often latches onto background texture rather than the object you care about. Its shift-tolerance is mild and local — rotate or rescale the object substantially and it can fail, unless you trained it on such variation. And the convolutional inductive bias is a helpful default, not a law: vision transformers, which use attention instead of fixed local filters, now match or beat CNNs when data is plentiful.