Gradient Descent, Step by Step

A surface made of mistakes

In the previous guide you met the loss function: a single number that says how wrong the model currently is on its training data. Now picture every adjustable parameter of the model — every weight and bias — as a dial you can turn. Sweep all those dials through all their settings and, for each combination, measure the loss. That gives you a landscape: a surface that rises where the model is wrong and dips where it is right. This is the loss landscape, and gradient descent is nothing more than a method for walking downhill on it.

The honest version is harder to picture, and worth saying out loud. A real network has millions or billions of dials, so the landscape is not a hill in three dimensions — it is a surface in a space of millions of dimensions. Nobody can see it. But the *local* rule for walking downhill is the same in a million dimensions as it is on a hillside: feel which way is downhill right where you stand, and take a step that way.

Which way is downhill? The gradient

Standing on a hillside in fog, you cannot see the valley — but you can still feel the slope under your feet. The mathematical version of that feeling is the gradient. For each dial, we ask the partial derivative: if I nudge *this one* dial up a hair and hold the rest still, does the loss go up or down, and how steeply? Collect those answers for every dial into one big arrow and you have the gradient — it points in the direction of *steepest uphill*. To go down, you simply step the opposite way.

How do we actually get all those partial derivatives without re-running the model a million times, once per dial? That is the job of backpropagation, which you met earlier: it computes the gradient for *every* weight in a single backward sweep, by applying the chain rule layer by layer. Gradient descent is the policy — step against the gradient — and backpropagation is the efficient machinery that hands it the gradient to step against.

# one step of gradient descent
grad = backprop(loss, params)      # arrow pointing uphill
for each p in params:
    p = p - learning_rate * grad[p]  # step the opposite way
# repeat until the loss stops dropping

The whole algorithm in four lines: compute the uphill arrow, then move every parameter a little in the downhill direction.

How big a step? The learning rate

The gradient tells you which way to step, but not how far. That distance is set by a single, hugely important knob: the learning rate. It is a hyperparameter — a setting *you* choose before training, not something the model learns. Multiply the gradient by the learning rate, and that product is the size of your step. Get it right and the descent is brisk and smooth. Get it wrong and the whole thing fails in one of two opposite ways.

Too small, and you inch down the hill so timidly that training takes forever — and may stall on a flat shelf long before reaching the bottom. Too large, and you overshoot: you leap clean across the valley to a point *higher* than where you started, then overshoot back the other way, bouncing wider and wider until the loss blows up to infinity. The sweet spot is a step bold enough to make real progress but gentle enough not to jump the valley. Finding it is mostly trial and error — which is why people watch a learning curve of loss-over-time to tell at a glance which failure they are in.

Walking on a sample, not the whole map

There is a practical catch. To compute the *true* gradient you would measure the loss over the entire dataset before taking a single step. With millions of examples, that is one painfully slow step. So in practice we cheat — productively. Stochastic gradient descent estimates the slope from just a few examples at a time, a mini-batch, and steps on that estimate. Each step is a little noisy, like reading the slope through fog, but you get thousands of steps in the time one true step would take.

One full pass through all the training data is called an epoch; training usually runs for many epochs, taking many noisy mini-batch steps in each. Surprisingly, the noise is often a *feature*, not a bug: the random jitter can knock the walker out of shallow dips it would otherwise get stuck in. Modern optimizers refine this further — momentum lets steps build up speed in a consistent downhill direction, and the Adam optimizer adapts the step size per dial automatically — but every one of them is still, at heart, this same downhill walk.

When does it stop — and where?

As you near a valley floor the ground flattens, so the gradient shrinks, so your steps naturally shorten. When the loss stops dropping in any meaningful way, we say the training has reached convergence. In practice nobody waits for the gradient to hit exactly zero; you stop when progress has stalled, or earlier — early stopping halts the moment performance on held-out data quits improving, which also guards against overfitting (the subject of a later guide in this rung).

But *where* does it stop? Here is the honest catch that textbooks once dwelt on. A downhill walk can only find a low point *near where it started* — a local minimum — and there is no guarantee that valley is the deepest one on the whole landscape (the global minimum). Worse, training can crawl to a near-halt on a saddle point: a spot that slopes down in some directions but up in others, like the seat of a saddle, where the gradient is nearly flat and the walker dawdles. These hazards are gathered under local minima and saddle points.

Now the genuinely surprising part, and a place where intuition from low dimensions misleads. In the million-dimensional landscapes of big networks, being trapped in a *bad* local minimum turns out to be rare: for a point to be a true minimum, the ground must curve upward in *every single* direction at once, which is wildly unlikely when there are millions of directions. Most flat spots are saddles, which momentum can roll through, and most reachable valleys are about equally good. This is why gradient descent works so much better in practice than the worried theory once predicted — not because the dangers are gone, but because high-dimensional geometry is kinder than a 3-D hill suggests.

What gradient descent really promises

Strip away the mystique and the method is humble: it is a hill-climber run in reverse, with no map and no memory of the wider terrain, taking small steps in whatever direction feels most downhill right now. It cannot guarantee the best possible model, only a locally good one — and it needs the loss surface to be differentiable, so you can always read a slope. That is a real limit, but a famously workable one. Almost every modern model, from linear regression to a large language model, is trained by some flavour of this same downhill walk.

So when someone says a model is "learning," you can now picture exactly what is happening underneath the word. There is no understanding, no insight — just a number called loss, a slope called the gradient, and millions of tiny steps downhill. The wonder is not that each step is clever; it is that so many dumb, careful steps, taken in the right direction, add up to something that can translate languages and recognise faces. The next guides in this rung open up the optimizers that steer those steps, and the overfitting that punishes a walk taken too far.