SGD, Mini-batches & Epochs

The problem with reading everything first

In the last guide you learned plain gradient descent: stand somewhere on the loss surface, measure which way is downhill, take a step, repeat. But there was a quiet assumption hiding in 'measure which way is downhill'. To compute the true gradient of the loss, you have to evaluate the model on every training example, add up all the errors, and only then move. That is called full-batch gradient descent.

With a tidy textbook dataset of a few hundred rows, that is fine. With a real one — a million images, billions of words — it is a disaster. You would burn through the entire dataset just to take one step downhill, and a model needs thousands of steps. Worse, the whole dataset often will not even fit in memory at once. The honest verdict: full-batch descent is mathematically clean and practically unusable at scale.

Here is the key escape. The true gradient is just an average of the per-example gradients. And to estimate an average, you do not need every member — you can poll a sample. Ask a hundred random examples 'which way is downhill?' and their average answer points roughly the same direction as the full crowd's, but it costs ten-thousand times less to compute. That single idea is what makes training large models possible at all.

Stochastic gradient descent: step often, see little

The most extreme version of this sampling idea is [[stochastic-gradient-descent|stochastic gradient descent]] (SGD). Instead of one giant averaged step over the whole dataset, you grab a single random example, compute its gradient, and step immediately. Then grab another, step again. The word 'stochastic' just means 'random' — each step is based on one randomly chosen data point rather than the full picture.

Think of the full-batch gradient as a careful survey of the entire population before you move, and SGD as asking one random passer-by and immediately acting on their answer. Any single answer can be wrong, even point uphill. But over many quick steps the errors largely cancel, and you make far more progress per minute because each step is so cheap. You trade the quality of each step for the sheer number of steps.

The path SGD traces is not a smooth glide down a valley — it zig-zags, wobbles, sometimes briefly climbs. That jitter is not a bug to be eliminated; as we will see, it can actually help. The learning rate you met before still controls step size, and it matters even more here: with such noisy directions, too large a rate makes the wobble explode.

The mini-batch: the sweet spot in the middle

Pure one-at-a-time SGD has its own problem: it is wasteful on modern hardware. A GPU is built to do thousands of multiplications in parallel, and feeding it a single example at a time leaves almost all of that power idle. So in practice nobody uses a batch of exactly one. Instead we compromise: process a small group of examples together. That group is a [[mini-batch|mini-batch]].

You compute the gradient averaged over, say, 32 or 256 examples, then take one step. This is the version almost everyone actually means when they say 'SGD' today. It sits between the two extremes: more stable than single-example SGD because averaging over 32 points cancels more noise, yet still vastly cheaper and faster than full-batch. The whole game is choosing where on that dial to sit.

Epoch vs iteration: counting the work

Two words constantly trip people up, so let us nail them down. An [[epoch-vs-iteration|iteration]] (also called a step or an update) is one mini-batch processed and one step taken. An epoch is one full sweep through the entire training set — every example seen exactly once. They count different things: iterations count steps, epochs count passes over the data.

The link between them is just division. If you have 10,000 training examples and a batch size of 100, then one epoch is 10,000 / 100 = 100 iterations. Train for 20 epochs and you have taken 2,000 steps total. Notice what this means: shrinking the batch size gives you more steps per epoch — more chances to learn from the same data — which is part of why small batches often learn faster per epoch.

shuffle(training_data)            # reshuffle each epoch
for epoch in range(num_epochs):
    for batch in split(training_data, batch_size):
        g = average_gradient(loss, batch)   # one mini-batch
        weights = weights - learning_rate * g   # one iteration
# iterations_per_epoch = dataset_size / batch_size

The core training loop. The outer loop counts epochs; each inner step is one iteration. Reshuffling every epoch keeps batches fresh and the noise unbiased.

Why the noise actually helps

It feels backwards that a noisier, sloppier gradient could ever beat a precise one. But the loss surface of a deep network is full of traps — shallow dips and flat saddle regions where a perfectly smooth full-batch descent can get stuck, because at such a point the exact gradient is nearly zero and there is nowhere obvious to go. SGD does not get stuck so easily.

The randomness acts like a gentle, constant nudge. When mini-batch descent lands in a shallow trap, the next noisy gradient is likely to be just wrong enough to kick it back out and let it keep searching for a deeper, better valley. The noise is a built-in source of exploration — and, somewhat magically, it tends to steer training toward 'flat' minima, the wide basins that usually generalize to new data better than narrow, brittle ones.

Be careful, though: this is a tendency, not a guarantee, and the field still argues about exactly why flat minima generalize. Do not over-read 'noise helps' as 'more noise is always better' — too much (a tiny batch with a high learning rate) just makes training thrash and never settle. The noise is a useful seasoning, not the main dish.

The speed–stability tradeoff: choosing a batch size

Now the central dial. A small batch means noisy, jittery gradients, but cheap fast steps and lots of helpful exploration — though it underuses the GPU and may need a smaller learning rate to stay sane. A large batch means smooth, accurate gradients and full hardware utilization, but each step costs more, you take fewer steps, and you risk sliding straight into a sharp minimum that generalizes poorly. Neither end is free.

There is also a hard ceiling that has nothing to do with theory: a mini-batch must fit in GPU memory along with the model and its intermediate activations. That memory wall, not math, is what caps batch size in much of real practice. A common rule of thumb: when you increase the batch size, you can usually increase the learning rate roughly in proportion, because a less noisy gradient lets you step more boldly.

So how do you choose? Honestly, batch size is a hyperparameter — there is no universally correct value, and 32 to 512 covers most cases. Pick something your hardware likes, watch the learning curve, and adjust. And remember the boundary of this guide: SGD only tells you the *direction and step size*. The clever modern tricks for *how big a step to take when* — momentum, Adam, schedules — are the subject of the next guide, where this engine really comes alive.