Momentum, Adam & Learning Rates

Why Plain Gradient Descent Isn't Enough

By now you know the core move: stochastic gradient descent reads a mini-batch, computes the gradient of the loss, and nudges every parameter a small step downhill. Repeat millions of times and, in principle, you reach a good solution. In practice, plain SGD is like walking down a foggy mountain by always stepping in the locally steepest direction — and that simple rule has two annoying failure modes that waste enormous amounts of training time.

First, ravines. Many loss surfaces are shaped like a long, narrow valley: steep across the width, nearly flat along the length. The steepest direction points across the valley, not down it, so SGD bounces wall-to-wall and creeps forward agonizingly slowly. Second, noise: because each step uses a different mini-batch, the gradient jitters, and the path zig-zags even on simple slopes. Both problems share a cure — somehow average the recent steps so the consistent direction reinforces and the random jitter cancels out.

Momentum: Giving the Ball Some Weight

Momentum borrows a metaphor from physics. Instead of a weightless point that teleports along each gradient, imagine a heavy ball rolling downhill. The ball keeps a velocity — a running memory of where it has been heading — and the gradient only nudges that velocity rather than dictating the whole step. Consistent downhill pushes accumulate into real speed, while side-to-side jitters in a ravine largely cancel because they keep flipping sign.

# v starts at zero. beta is the momentum (e.g. 0.9)
v = beta * v + (1 - beta) * grad   # update the running velocity
param = param - lr * v             # step along velocity, not raw gradient

Momentum keeps an exponential moving average of gradients. beta=0.9 means each step remembers roughly the last ten gradients.

The single knob `beta` (often 0.9) controls how much past matters: higher means a heavier ball with more inertia and smoother motion, but also more overshoot — it can sail past the bottom of a valley before turning around. That overshoot is usually a feature, not a bug, because it helps the ball coast across small bumps and shallow local dips that would trap a timid optimizer. A common refinement called Nesterov momentum peeks one step ahead before committing, which damps the overshoot a little.

Adaptive Optimizers: A Different Step Size Per Parameter

Momentum fixes the *direction* problem. Adaptive optimizers attack a different one: a single global step size is wrong for a model where some parameters see huge gradients and others see tiny ones. AdaGrad was the first popular answer. It keeps a running sum of each parameter's squared gradients and divides that parameter's step by the square root of the sum — so frequently-large-gradient weights get small, careful steps, and rarely-updated weights get large ones. Wonderful for sparse data, but the sum only grows, so steps shrink toward zero and learning eventually stalls.

RMSprop fixes the stall with a tiny twist: instead of summing all past squared gradients forever, it keeps an *exponential moving average* of them. Old gradients fade away, so the per-parameter step size can grow or shrink as training moves into new terrain, and learning never grinds to a halt. RMSprop is essentially AdaGrad with a short, sliding memory — and it remains a solid choice, especially for recurrent networks.

Adam — short for *Adaptive Moment Estimation* — is the workhorse you will see in almost every modern recipe. Its trick is simply to combine the two ideas above: it keeps momentum's moving average of the gradient (the "first moment", giving direction and smoothing) *and* RMSprop's moving average of squared gradients (the "second moment", giving per-parameter scaling). A small bias-correction term keeps the early steps honest before the averages have warmed up. The result is an optimizer that mostly just works out of the box, which is exactly why it took over.

Learning-Rate Schedules and Warmup

Even a great optimizer needs the right step size, and the best step size changes over training. The learning rate is the single most important number you set — too large and the loss explodes or oscillates, too small and training crawls. The deep insight is that no *constant* value is ideal: early on you want big strides to cover ground, but near a minimum you want tiny, careful steps so you don't bounce around it forever. A learning-rate schedule simply changes the learning rate as training proceeds.

The most common shapes are *step decay* (cut the rate by 10x at a few milestones), *cosine decay* (smoothly glide the rate down to near zero following a cosine curve), and *exponential decay*. Cosine decay is the modern favorite for large models because its gentle taper tends to land in flatter, better-generalizing regions. Whichever you pick, the principle is the same: start bold, end gentle.

Warmup is the counterintuitive opposite at the very start: for the first few hundred or few thousand steps, you *ramp the learning rate up* from near zero to its peak, and only then begin the decay. Why? At initialization the model is random and Adam's variance estimates are unreliable, so a full-size step can blow the weights apart and the loss may never recover. Warmup lets the optimizer's running averages stabilize before you trust them with big steps. For training large Adam-trained transformers, warmup-then-cosine-decay is close to a universal default.

Practical Tuning: What to Actually Do

Theory aside, tuning is mostly disciplined trial. The good news is that the search is far smaller than it looks: get the learning rate roughly right and most other knobs barely matter. Here is a reliable order of operations that wastes the least compute.

Start with a known-good default: Adam (or AdamW) with learning rate 3e-4, betas 0.9/0.999. This alone trains a huge fraction of models acceptably.
Find the learning rate first and alone. Sweep over powers of ten — say 1e-2, 1e-3, 1e-4 — for a short run each. Pick the largest rate at which the loss still falls smoothly without spiking.
Watch the loss curve, not just the final number. A loss that explodes or NaNs means the rate is too high; a loss that descends in a near-flat line means it is too low.
Add a schedule once the peak rate is set: warmup for the first ~2–5% of steps, then cosine decay to near zero. This usually buys a free accuracy bump.
Only then touch batch size, weight decay, or betas — and remember the rough rule that a larger batch lets you raise the learning rate.

A few honest warnings. The famous "3e-4 is the best learning rate for Adam" is a meme, not a theorem — it is a fine *starting* point, nothing more. Beware that a learning rate which looks perfect for a tiny test run is often too high once you scale up the model or data. And convergence of the training loss does not mean you are done: a model can converge beautifully to something that overfits, which is exactly the fight the next guide in this rung takes up.