The memory wall
Every guide on this rung so far has been about making compute *faster* — smaller transistors, FinFET and GAA channels, more cores, specialized engines. But a processor that can do a trillion multiplications a second is useless if it spends most of its time *waiting for the numbers to multiply*. For decades, logic speed and memory speed grew at very different rates: processor throughput climbed steeply, while the speed at which main memory could deliver data crawled. The gap between them widened year after year, and that widening gap has a name — the memory wall.
It helps to see *why* the two diverged. Logic rode Moore's law hard: pack more, switch faster, repeat. Main memory — DRAM — is built for density first, cheap bits by the billion, and the physics of reading a bit out of a far-away storage cell and shipping it across a circuit board simply did not speed up the same way. So the chip kept getting hungrier while the kitchen kept serving at the same pace. The result is a processor that is, much of the time, *idle* — not because it ran out of work, but because the data it needs has not arrived yet.
Bandwidth, not just capacity
Here is the distinction that trips most people up. Memory has two very different numbers. Capacity is *how much* it can hold — gigabytes, the size of the warehouse. Bandwidth is *how fast* you can move data in and out — gigabytes per second, the width of the loading dock. People instinctively worry about capacity ("is my model too big to fit?"), but on a modern accelerator the thing that actually runs out, again and again, is bandwidth.
A kitchen analogy makes it concrete. Capacity is the size of your pantry; bandwidth is how many ingredients you can carry from the pantry to the stove per minute. You can own a warehouse of food and still cook slowly if there's a single narrow doorway between the two. Worse, you can measure the problem: if your processor needs, say, 100 bytes of data for every 1 arithmetic operation, then no matter how fast the arithmetic unit is, the doorway sets your real speed. Engineers track this as arithmetic intensity — operations done per byte fetched. When intensity is low, you are *memory-bound*, and a faster compute engine buys you nothing at all.
HBM: stacking DRAM
So how do you widen the loading dock? One brute-force answer: instead of laying memory chips out *flat* on a board and reaching them through a narrow set of pins, stack them vertically into a tall tower and drill connections straight down through the silicon. That is High Bandwidth Memory (HBM): a stack of ordinary DRAM dies — typically 8, 12, or more — sitting one on top of another, bonded into a single cube.
The magic is in the vertical wiring. Each die in the stack is pierced by thousands of through-silicon vias (TSVs) — tiny copper-filled holes that run *through* the silicon, connecting one die directly to the die above and below. Instead of signals going out to the edge of a chip, across a board, and back, they take a short elevator ride straight up the stack. Short connections mean you can afford a *lot* of them, and a lot of parallel connections is exactly what bandwidth is made of: a 1024-bit-wide path per stack, versus the 32- or 64-bit paths of conventional memory.
ordinary DRAM (flat, narrow bus) HBM stack (tall, very wide bus)
[DRAM] [DRAM] [DRAM] +----------+
| | | | DRAM 8 |
+------+------+ <- ~64-bit bus, +----------+
| long board traces | DRAM 7 | } each die
[ logic ] +----------+ pierced by
| ... | 1000s of
wide = few wires, far apart +----------+ TSVs (||||)
| DRAM 1 |
+----------+
| base die | <- 1024-bit bus
+----------+
|||| TSVs straight upHBM on the interposer
A 1024-bit bus is wonderful, but it creates a new problem: a thousand wires have to get from the memory cube into the logic chip, and an ordinary package can't route that many fine connections side by side. The solution is to seat *both* the HBM stack and the logic die on a shared slab of silicon called an interposer — essentially a tiny, ultra-dense circuit board, but made of silicon so it can carry wires as fine as the chips themselves.
Because the interposer is silicon, it can pack thousands of microscopic traces between the memory and the logic, side by side, just a few millimetres apart. The HBM stack ends up *right next to* the processor — not across a board, but practically touching it — joined by a bus far too wide to route any other way. This arrangement is a flagship example of heterogeneous integration and advanced packaging: the memory and the logic are made separately, on the processes each does best, then married on the interposer. It is the same chiplet philosophy as the chiplet guide — build the right piece on the right process, then stitch — applied to memory.
short, very wide bus (1000s of wires)
+-----------+ <==============================> +-----------+
| HBM stack | | LOGIC |
| (DRAM x8 | | die (GPU/|
| + TSVs) | | AI accel)|
+-----------+ +-----------+
===|===========|=======================================|===========|===
| silicon INTERPOSER (fine wiring, both dies sit on it) |
===============================================================
| package substrate |
+--------------------------------------------------------------+
o o o o o o o o o o o <- solder balls to boardNear- & in-memory compute
HBM widens the loading dock dramatically — but notice it's still the *same idea*: fetch the data, ship it to the compute engine, do the math there. Every byte still makes a round trip, and moving a byte costs energy and time. So a more radical question follows: what if we stopped hauling the data to the compute, and instead moved a little of the compute to the data?
That is the family of ideas called near-memory and in-memory compute. *Near-memory* puts simple processing logic right beside the memory — for instance, on that base die underneath an HBM stack — so reductions and filtering happen before the data ever leaves. *In-memory* (or compute-in-memory) goes further still, performing the operation *inside* the memory array itself: because of how a memory grid is wired, a row of cells can be coaxed into doing a multiply-and-add almost for free, exactly the operation AI leans on most. The motivation is blunt — if the wire is the bottleneck, the cheapest data movement is the one you never make.
The AI bandwidth hunger
Every workload feels the memory wall, but AI slams into it hardest, and it's worth understanding why. Training and running a large neural network is, at its core, multiplying enormous matrices of numbers — billions of weights. The arithmetic per number is tiny (a multiply and an add), but you must stream *every weight* through the compute engine, often for every batch of inputs. That is the low-arithmetic-intensity, memory-bound case from earlier, at planetary scale: the model is so large it cannot live in fast on-chip memory, so it has to be fetched, continuously, from HBM.
This is exactly why the AI accelerators we'll meet in the final guide — GPUs, TPUs, NPUs — are wrapped in stacks of HBM, and why each new chip generation brags about *memory bandwidth* as loudly as about compute. A domain-specific architecture for AI isn't just a faster multiplier; it's a faster multiplier fed by the widest possible pipe, with the data laid out so the pipe never runs dry. The capstone synthesis that closes this track ties the threads together: specialization, packaging, and bandwidth are not separate tricks but one strategy — when you can no longer make the transistor better, you make the *whole system* better, and feeding the beast is half the battle.