When the general machine runs out of road
A modern CPU is a marvel of generality. It speculates past branches, reorders instructions, prefetches data, and keeps several arithmetic units fed — all so that *whatever* program you throw at it runs respectably well. But generality has a price. To run an unknown instruction stream, the CPU must spend most of its transistors and most of its energy on *control*: fetching, decoding, predicting, scheduling. The actual arithmetic — the part you paid for — is a thin sliver of the silicon and a thin sliver of the power budget.
Now suppose you *do* know the workload. Deep-learning inference, scientific simulation, graphics shading — these are not arbitrary spaghetti. They are the same handful of operations, mostly a matrix multiply or a stencil, repeated across oceans of data with almost no branching. For a workload that regular, a general CPU is a tragic mismatch: you are paying for branch predictors you never exercise and out-of-order machinery you never need. The escape hatch is a domain-specific architecture (DSA) — a chip co-designed with one class of algorithm, where nearly every transistor does useful arithmetic.
The GPU: thousands of threads marching in step
The first step away from the CPU is not to specialise the *arithmetic* but to specialise the *control*. A GPU makes a brutal bet: most of my work is the same instruction applied to vast amounts of different data. So instead of one elaborate control unit steering one thread, build one modest control unit steering thirty-two threads at once. That bundle of thirty-two is called a *warp* (NVIDIA) or *wavefront* (AMD), and the execution model is SIMT — Single Instruction, Multiple Threads.
Picture a rowing crew. One coxswain calls the stroke; thirty-two rowers pull together. You amortise the cost of *one* fetch-and-decode across thirty-two lanes of arithmetic, so the control overhead per useful operation collapses. A single high-end GPU streaming multiprocessor can keep dozens of warps *resident* simultaneously, and here is the trick that hides memory latency: when one warp stalls waiting on data, the scheduler instantly swaps in another ready warp. With enough threads in flight, the arithmetic units never idle. The GPU does not *avoid* the long trip to memory the way a cache tries to — it *covers* the trip by always having other work to do.
There is a catch that every GPU programmer learns the hard way. Because thirty-two threads share one program counter, a branch that sends *some* lanes one way and the rest another forces the hardware to run both paths, masking off the idle lanes each time. This is *warp divergence*, and it can halve or quarter your throughput. SIMT loves code that is wide, regular, and branch-free — which is exactly why it is so good at linear algebra and so awkward at, say, parsing.
Naming the real enemy: the memory wall
Before we build something even more specialised, we have to be honest about *why* matrix-multiply hardware looks the way it does. The bottleneck is almost never the multipliers. A multiply-accumulate (MAC) in a modern process costs a fraction of a picojoule. Moving the operands costs far more. Reading one number from off-chip DRAM can cost a hundred times the energy of the multiply that consumes it — and take hundreds of cycles. This gulf between cheap arithmetic and expensive data movement is the memory wall, and it is *the* defining constraint of modern accelerator design.
Rough energy per operation (45 nm-class, order-of-magnitude): 32-bit integer ADD ............. 0.1 pJ <- arithmetic is nearly free 32-bit float MULTIPLY .......... 4 pJ read 32 bits from on-chip SRAM . 5 pJ <- the register/cache file read 32 bits from on-chip cache . ~50 pJ read 32 bits from off-chip DRAM . 640 pJ <- ~100x a multiply! Lesson: the joules are in the WIRES, not the math. A datum you fetch once and reuse 100 times is 100x cheaper per result than a datum you fetch fresh every multiply.
Two consequences follow, and they shape everything in this rung. First, you fight the wall with bandwidth: stacked high-bandwidth memory (HBM) sits inches — millimetres, really — from the compute die and delivers terabytes per second through thousands of wires, instead of the narrow bus to ordinary DRAM. Second, and more cleverly, you fight the wall with reuse: design the datapath so that once a number is on-chip, it is multiplied many times before it leaves. The systolic array is the most elegant machine ever built around that second idea.
The systolic array: arithmetic as a heartbeat
H. T. Kung named the systolic array after the human heart: data is pumped, rhythmically, through a grid of tiny identical cells, each one beating on the same clock. There is no shared register file and almost no control logic. Each cell does exactly one thing every cycle — multiply two incoming numbers, add the product to a running total, and pass the inputs along to its neighbours. The genius is that operands ripple from cell to cell, so a value loaded once at the edge of the array is reused by an entire row or column of multipliers before it ever needs to be re-fetched from memory.
This is a dataflow architecture in its purest form. A CPU is *control-driven*: a program counter decides what happens next. A systolic array is *data-driven*: computation happens wherever and whenever the data arrives. There is no instruction to fetch, no branch to predict, no operand to name — the wiring *is* the program. That is why a 256×256 array of MACs, like the one in Google's first TPU, can perform 65,536 multiply-adds every single clock while reading only a couple of values per cycle from memory.
One processing element (PE) of a weight-stationary array.
The weight W is loaded ONCE and stays put; activations flow through.
a_in (activation, flows left -> right)
|
psum v a_out -> (to PE on the right)
_in -> [ * + ] -------->
| ^
| W (weight: loaded once, held for the whole matmul)
v
psum_out (partial sum, flows top -> bottom)
each clock: psum_out = psum_in + (a_in * W)
a_out = a_in // hand activation to neighbourWorked example: streaming a 2×2 matrix multiply
Let us make the heartbeat concrete on the smallest array that shows the trick: a 2×2 grid computing C = A · B. The four cells will each accumulate one entry of C. The weights B sit *stationary* in the cells; the rows of A flow in from the left, skewed in time so that the right operands meet at the right cell on the right cycle. That deliberate diagonal skew is the secret choreography of every systolic array.
Compute C = A . B with stationary weights B in a 2x2 array.
A = | a11 a12 | B = | b11 b12 |
| a21 a22 | | b21 b22 |
Weights pinned in cells: PE[r][c] holds B[r][c]
PE00<-b11 PE01<-b12
PE10<-b21 PE11<-b22
Activations a-row enters from the LEFT, skewed by one cycle per row
so column 2 arrives one beat after column 1:
row1 feed: a11, a12 (start t=1)
row2 feed: a21, a22 (start t=2, one cycle later)
Each cell: psum += a_in * W , then forward a_in downward/rightward.
After the wave passes, the partial sums have summed exactly:
c11 = a11*b11 + a12*b21
c12 = a11*b12 + a12*b22
c21 = a21*b11 + a22*b21
c22 = a21*b12 + a22*b22 <-- a full matrix multiply, no register file- Load the weights. Each B entry is streamed in once and latched into its cell. From here on, B never moves — that is the reuse we are buying.
- Feed A, skewed. Push row 1 of A in cycle 1, row 2 in cycle 2. The one-cycle stagger lines each activation up with the partial sum it must join.
- Beat the clock. Every cycle, each cell multiplies its incoming activation by its stored weight, adds to its accumulator, and passes the activation to its neighbour. No cell is ever idle once the wave reaches it.
- Drain the results. After the activation wave has swept through, each accumulator holds one finished entry of C. Read them out at the array's edge while the next tile is already streaming in.
Count the memory traffic, because that is the whole point. To produce four results we loaded four weights and eight activations — twelve values — and performed eight multiply-adds. Scale that 2×2 toy up to a 256×256 array and the arithmetic grows as N², 65,536 MACs per cycle, while the operands you feed grow only as N. Each weight you load is reused across an entire column; each activation across an entire row. The array turns one trip to memory into hundreds of multiplies — which is exactly the cure the memory wall demanded.
Assembling a real accelerator — and knowing its limits
A shipping NPU or TPU is the systolic array plus the scaffolding that keeps it fed. Wrapped around the MAC grid you will find a large on-chip SRAM scratchpad (a software-managed cousin of the cache, explicitly loaded rather than guessed at), a stack of HBM for the model's weights, and special-function units for the things a matrix multiply can't do — the activation functions, normalisation, and pooling that sit between layers. The whole chip is a pipeline at the macro scale: stream a tile in, multiply, apply the activation, write the result back, repeat.
Notice how the three machines form a ladder of specialisation. A multicore CPU gives you a few dozen fully independent, branch-happy threads. A GPU gives you tens of thousands of lock-stepped threads and keeps general-purpose flexibility. The systolic array throws away flexibility entirely and hard-wires the dataflow of one operation. Each rung up buys efficiency by spending generality — the domain-specific bargain again, now visible as a spectrum rather than a single choice.
And that is the applied climax of the architecture half of this track. We began with a single instruction crawling through a CPU and end with a wavefront of arithmetic flowing through silicon at terabytes per second. The thread connecting them is not cleverer math — it is the relentless, physical truth that computation is cheap and data movement is dear, and that the winning designs are the ones that rearrange the wires so a number, once fetched, earns its keep many times over.