Hardware: GPUs, TPUs & Scaling

Why not just a CPU?

By now you know that a neural network is, underneath, a long chain of matrix multiplications glued together by activation functions. Both the forward pass and backpropagation are, in arithmetic terms, the same handful of operations repeated billions of times on big grids of numbers. A regular CPU is a brilliant generalist: a few very fast cores that can do almost anything, one complicated thing after another. That is exactly the wrong shape for this job. We don't need cleverness per step; we need to do *one simple step* — multiply and add — an absurd number of times, all at once.

A GPU (graphics processing unit) was born to paint millions of pixels at once, so it is built the opposite way: thousands of small, simple cores running the *same* instruction over different data in lockstep. That happens to be exactly what a matrix multiply is. A TPU (tensor processing unit) goes further — a chip designed by Google specifically for the tensor math of deep learning, with circuitry that streams numbers through a grid of multiply-add units so they never have to make slow trips to memory. Both belong to the family we call GPUs and TPUs, and the headline reason they matter is parallelism: not faster steps, but vastly more steps happening in the same instant.

When one chip is not enough

A single modern accelerator is extraordinary, yet a foundation model or a large language model outgrows it in two separate ways, and it is worth keeping them apart. The first is time: even a fast chip would take years to grind through a training run on trillions of tokens. The second is memory: the model's parameters, plus the gradients and optimizer state needed to update them, simply do not fit in one chip's memory. Either problem forces you onto distributed training — splitting the work across many chips, often many machines, wired together with high-speed links.

Here is the catch that defines the whole field: chips that work together must *talk* to each other, and talking is slow compared to computing. Every time two GPUs need to agree on a number, bytes have to travel down a cable, and that cable is far slower than the math inside the chip. So distributed training is never a clean speed-up — adding a second machine never quite doubles your throughput. The entire art is arranging the work so the chips spend their time *computing* and as little as possible *waiting to communicate*. Keep that tension in mind; it explains every design choice that follows.

Three ways to split the work

There are three honest answers to "what, exactly, do we split?", and real systems blend all three. They are collected under the name data, model and pipeline parallelism. The simplest is data parallelism: every chip holds a *full copy* of the model but sees a *different slice* of the mini-batch. Each computes gradients on its own slice, then all the chips average their gradients together so every copy stays identical. This is easy to reason about and scales well — until the model itself is too big to fit on one chip, at which point copying it everywhere is impossible.

When the model won't fit, you split the *model* instead. Model (or tensor) parallelism cuts a single huge layer across chips — chip A holds the left half of a weight matrix, chip B the right half — so one matrix multiply is computed jointly, with chips exchanging partial results mid-operation. Pipeline parallelism instead puts *different layers* on different chips, like stations on an assembly line: chip 1 does the first few layers and hands its output to chip 2, and so on. The danger here is the "bubble" — chip 2 sits idle while it waits for chip 1's first output — so engineers feed in many micro-batches at once to keep every station busy, the way a real assembly line never stops at the first worker.

DATA PARALLEL        each chip = full model, different data slice
   chip0 [MODEL] <- batch[0:8]   \
   chip1 [MODEL] <- batch[8:16]   >  average gradients --> sync
   chip2 [MODEL] <- batch[16:24] /

MODEL PARALLEL       one big layer split across chips
   chip0 [layer L: left half ]
   chip1 [layer L: right half]   exchange partials mid-multiply

PIPELINE PARALLEL    different layers on different chips
   chip0 [layers 1-8] -> chip1 [layers 9-16] -> chip2 [layers 17-24]

The three splits at a glance. Real training runs of large models stack all three at once ("3D parallelism"): pipeline across machines, tensor split within a machine, data parallel over the whole cluster.

Mixed precision: half the bits, nearly all the accuracy

There is a second lever, orthogonal to splitting work, that buys speed and memory almost for free. By default, every number in a network is stored as a 32-bit float — generous precision, but expensive in both memory and the energy to move it. [[mixed-precision-training|Mixed-precision training]] does most of the math in 16-bit (or even 8-bit) floats instead, which roughly halves memory use and lets the chip's specialized units run faster, while keeping a 32-bit master copy of the weights for the parts that genuinely need the extra digits. The surprise is how little accuracy you lose: a network trained on noisy data is robust enough that shaving precision off most operations barely moves the final result.

The real cost of compute

All this hardware is expensive in ways that go well past the rental bill. There is money — training a frontier model can cost millions of dollars in chip-hours. There is power and the environmental cost of training — large runs draw megawatts, and the water and carbon footprint of the data center are real. And there is a subtler cost: engineering time, because a cluster that should run for weeks will hit failed chips, network hiccups, and stalls, and someone has to babysit it. It helps to separate two regimes. Training is a one-time, brutal expense; serving the model afterward — its inference cost — is small per request but paid forever, on every single query from every user.

This cost structure is also why the "bitter lesson" — that throwing more compute and data at general methods tends to beat hand-crafted cleverness — is true but easy to misread. "Just scale it up" works only if you can afford the bill, control the engineering, and accept diminishing returns. The scaling laws researchers measured are real and smooth, but they are *power laws*: each new increment of capability costs disproportionately more compute than the last. Scaling is a strategy, not a free escalator to ever-greater capability — and certainly not an automatic path to general intelligence. The honest framing is engineering economics: you are buying a known amount of improvement at a steeply rising price.

Which is the encouraging close to this guide: hardware is a lever, not a destiny. The next rungs are largely about *spending less* — shrinking models so they serve cheaply, batching requests so each chip does more, and watching the trade-off between latency and throughput. You don't need a thousand GPUs to do excellent work. You need to understand what the hardware is good at, where the costs hide, and when to scale up versus when to scale *smart*.