Backpropagation Demystified

The problem: who gets the blame?

In the last guide you sent an input through the net — the forward pass — and out came a prediction. You compared it to the right answer and got a single number: the loss, a measure of how wrong the network was. Now comes the hard part. The network has thousands or millions of weights scattered across its layers. To improve, it needs to know, for *each one of them*, the same thing: if I made you a tiny bit bigger, would the loss go up or down, and by how much?

That "how much would the loss change" number is a gradient — strictly, a partial derivative of the loss with respect to that one weight. Collect the gradient for every weight and you have a giant arrow pointing in the direction that makes the loss worse fastest. Step the opposite way — that is gradient descent, which the next guide covers — and the net improves. Backpropagation is *not* the learning step. It is the thing that computes all those gradients, efficiently, before learning happens.

You could imagine a brute-force way to get these numbers: nudge one weight, run the whole forward pass again, see how the loss changed, repeat for all million weights. That works, but for a million weights it means a million full passes per single training example. It is hopelessly slow. Backpropagation gets *exactly* the same answers in roughly the cost of one extra pass — and that efficiency is the whole reason deep nets are trainable at all.

One trick to rule them all: the chain rule

A neural network is a long chain of simple operations stacked on top of each other: multiply by weights, add a bias, squash through an activation, feed the result into the next layer, and so on until the loss. When functions are nested like that, calculus has a precise tool for finding how a change at the very start ripples to the very end: the chain rule. Its idea is almost insultingly simple — *multiply the local rates of change along the path.*

Think of a row of gears. If gear A turns three times as fast as gear B, and gear B turns twice as fast as gear C, then turning A moves C six times as fast — you just multiply, 3 × 2. The chain rule says exactly this for functions: the sensitivity of the loss to an early weight is the product of the little sensitivities at every link between that weight and the loss. Each link is a *local* question — "how does my output change when my input changes?" — and local questions are easy. A multiply node, an add node, a ReLU all have dead-simple local derivatives.

The computational graph: the map of every calculation

To run the chain rule mechanically, we first draw the calculation as a computational graph: a diagram where each node is one small operation (a multiply, an add, an activation) and arrows show which result feeds into which. The forward pass is just walking this graph left to right, filling in a number at each node. Crucially, every node also remembers its inputs — it will need them on the way back.

Then we walk the graph the other way, right to left, and this is where the name *back*propagation comes from. We start at the loss with a gradient of 1 (the loss's sensitivity to itself is exactly 1) and push that signal backward. At each node we ask the local question, multiply the incoming gradient by the local derivative, and hand the product to the nodes feeding in. The signal that arrives at any weight, after all those multiplications, *is* its gradient. One backward sweep, every gradient at once.

When a node's output fans out to feed several places downstream, gradients flowing back from all those places simply *add up* at the node — blame from every path it influenced is summed. That single rule (multiply along a path, add across paths) is the entire algorithm. It is why a weight in an early hidden layer, which touches the loss through many downstream routes, correctly accumulates the blame from all of them.

# Forward: remember each node's inputs
z = w * x + b          # node sees x
a = relu(z)            # node sees z
loss = (a - target)^2  # node sees a

# Backward: start at 1, multiply local derivatives
g_loss = 1
g_a = g_loss * 2*(a - target)   # d loss / d a
g_z = g_a * (1 if z > 0 else 0) # d relu / d z
g_w = g_z * x                   # d z / d w  -> gradient for w!
g_b = g_z * 1                   # d z / d b  -> gradient for b!

One neuron, forward then backward. Each backward line is just "incoming gradient × local derivative."

Automatic differentiation: the machine does the calculus

Here is the liberating part: you never have to derive any of this by hand. Modern frameworks build the computational graph for you as the forward pass runs, then replay it backwards automatically. This is automatic differentiation (often "autodiff" or "autograd"). It is not the symbolic algebra you did in school, and it is not the shaky finite-difference nudging from Section 1 — it computes exact gradients by composing the known local derivatives of each primitive operation.

Practically this means you write only the forward computation — the layers, the activation, the loss — in plain code, and call something like loss.backward(). The framework already knew the local derivative of every multiply, add, ReLU, and softmax it ran, so it walks the recorded graph backward and deposits a gradient onto every parameter. Backprop is the *specific* application of reverse-mode autodiff to a network's loss; autodiff is the general engine.

When the gradient signal fades

That "multiply along the path" rule has a dark side. Send a gradient back through many layers and you are multiplying many numbers together. If those local derivatives are mostly smaller than 1 — which old activations like the sigmoid tend to produce, since their slope is tiny except in a narrow band — the product shrinks toward zero. By the time the signal reaches the earliest layers it is a whisper. Those layers barely update, and they learn painfully slowly. This is the famous vanishing gradient problem.

The mirror image also happens: if the local factors are larger than 1, the product can explode into huge numbers and training blows up. These are not exotic bugs — they are the direct, honest consequence of the chain rule's multiplication, and for years they made deep networks nearly impossible to train. Much of the progress that unlocked modern deep learning was about keeping this backward signal alive: activations like ReLU that don't squash the slope, careful weight initialization, and architectural tricks that give the gradient a clean shortcut home.

It is worth being honest about what backprop is and isn't. It is exact, efficient, and the workhorse behind essentially every net you have heard of. But it is not how brains learn — real neurons have no global backward pass shipping error signals down the exact wires they came up. And it is not intelligence; it is a slope-finding procedure. The remarkable thing is how far that humble procedure, repeated billions of times, can carry a pile of numbers.

Putting it together

So here is the full loop, the one that repeats for every batch of data, millions of times, as a network learns. Backpropagation is steps 2 and 3 — the part that figures out which direction is downhill for every weight at once.

Forward pass: feed an input through the layers, recording each operation in the computational graph, and compute the loss at the end.
Seed the backward pass: start at the loss with a gradient of 1.
Backpropagate: sweep right-to-left through the graph, at each node multiplying the incoming gradient by the local derivative and summing where paths merge, until every weight holds its own gradient.
Update (this is the optimizer, not backprop): nudge every weight a small step opposite its gradient. Then go back to step 1 with the next batch.

That is the whole machine. The chain rule supplies the math, the computational graph supplies the organization, automatic differentiation supplies the labor, and gradient descent turns the resulting gradients into actual learning. In the next and final guide of this rung, we put a full net through this loop on real data and watch the loss curve fall — the moment the network truly teaches itself.