The race the processor keeps losing
Imagine a master chef who can chop, sear, and plate a dish in one second flat — but the only pantry is a warehouse a ten-minute walk away. It doesn't matter how fast her knife is; she spends her whole shift walking. That is exactly the situation a modern CPU is in. The core's arithmetic units can retire several instructions per nanosecond, yet a trip to main memory to fetch one operand takes 60 to 120 nanoseconds. In the time it takes DRAM to answer once, the core could have executed hundreds of instructions. The beast is starving while it waits for food.
This is the [[memory-wall|memory wall]], and it grew out of a simple divergence. For decades processor clock speeds climbed steeply — partly thanks to Dennard scaling, which let each smaller transistor switch faster at the same power density. DRAM, by contrast, is built around a tiny capacitor that must be charged and sensed; its raw access latency has barely improved in thirty years. So the two curves separated. CPUs got ~1000× faster; DRAM latency improved maybe 10×. The gap is the wall.
Why a staircase works: locality
If memory accesses were truly random — every byte equally likely at every moment — no trick could help; you'd pay the DRAM penalty almost every time. We are saved by a deep empirical truth about real programs called locality, and it comes in two flavours.
- Temporal locality — if you touched an address now, you'll probably touch it again soon. Loop counters, the top of the stack, a hot lookup table: programs return to the same handful of locations over and over.
- Spatial locality — if you touched an address, you'll probably touch its neighbours soon. Walking through an array, executing instructions in sequence, reading a struct field-by-field: accesses cluster in nearby addresses.
Locality is what makes a memory hierarchy pay off. We build a staircase of stores, each step bigger and slower than the one above it. The topmost steps are tiny and blisteringly fast; the bottom is huge and slow. We keep the recently- and soon-to-be-used data near the top. Because of temporal locality, the data we just used is still up there when we need it again. Because of spatial locality, we fetch data in *blocks* (a whole 64-byte line at once), so grabbing one byte cheaply pre-loads its neighbours.
Level Typical size Latency (cycles) Made from ----- ------------ ---------------- --------- Registers ~ 1 KB total 0 flip-flops in datapath L1 cache 32-64 KB ~4 fast 6T SRAM, split I/D L2 cache 256 KB-2 MB ~12-15 SRAM, per-core L3 cache 8-64 MB ~40 SRAM, shared by all cores Main memory 8-128 GB ~200-400 DRAM (1T1C cells), off-chip SSD / NVMe 0.5-8 TB ~ 100,000+ flash, non-volatile
A worked example: hit, miss, and the cost of a stride
Let's make the numbers concrete. A cache line is the unit of transfer — say 64 bytes, which holds 16 four-byte integers. When the core asks for an address, the cache checks whether the line containing it is already present. If yes, it's a hit (fast); if no, a miss — the cache fetches the whole 64-byte line from the next level down and pays the penalty.
// Sum a 1,000,000-element int array (4 bytes each, 64-byte lines)
// => 16 ints per line.
long sum = 0;
for (int i = 0; i < N; i++) sum += a[i];
Access pattern, sequential (stride 1):
a[0] MISS -> fetch line, brings a[0..15]
a[1] hit a[2] hit ... a[15] hit
a[16] MISS -> fetch next line, brings a[16..31]
...
=> 1 miss per 16 accesses -> miss rate = 1/16 = 6.25%
Effective access time (AMAT):
AMAT = hit_time + miss_rate * miss_penalty
= 4 cyc + 0.0625 * 200 cyc
= 4 + 12.5 = 16.5 cycles (average per element)Now break the locality. Suppose you walk the same array but with stride 16 — touching `a[0], a[16], a[32], …` — perhaps because you're reading one column of a matrix stored row-major. Now *every* access lands on a fresh line, so every access misses:
Stride-16 access: a[0] MISS a[16] MISS a[32] MISS ... every one a miss => miss rate ~ 100% AMAT = 4 + 1.00 * 200 = 204 cycles per element Slowdown vs. sequential: 204 / 16.5 ~= 12x SLOWER Same instructions. Same arithmetic. Same data. Only the *order of access* changed.
How a cache decides where things go: associativity
A cache is much smaller than memory, so many memory addresses must compete for the same storage slots. *How* an address maps to a slot is the question of associativity, and it's a classic speed-vs-flexibility trade-off. Think of it as a coat-check.
- Direct-mapped — every coat has exactly one assigned hook, decided by its number. Lookup is trivially fast (check one place), but two popular coats assigned to the same hook keep evicting each other even when the rest of the rack is empty. This is a *conflict miss*.
- Fully associative — any coat can hang on any hook. Zero conflict misses, perfect flexibility — but to find a coat you must check *every* hook in parallel, which is expensive in power and area. Only tiny structures (like a TLB) afford this.
- Set-associative — the sweet spot. The rack is divided into small sets (say 8 hooks each); a coat's number picks its set, then it may use *any* hook within that set. An '8-way set-associative' cache checks 8 candidates per lookup. Real L1/L2/L3 caches live here, typically 4-, 8-, or 16-way.
When a set is full and a new line arrives, the cache must evict someone — this is the replacement policy. The intuitive choice is LRU (Least Recently Used): kick out whoever hasn't been touched for longest, betting on temporal locality. True LRU is costly to track for high associativity, so real designs use cheap approximations (pseudo-LRU trees, re-reference prediction). The principle stays: evict the line least likely to be needed again soon.
Address breakdown for a 32 KB, 8-way, 64-byte-line L1 (32-bit addr):
line size = 64 B -> offset = log2(64) = 6 bits
num lines = 32768 / 64 = 512
num sets = 512 / 8 (ways) = 64
-> index = log2(64) = 6 bits
tag = 32 - 6 - 6 = 20 bits
+---------------------+--------+--------+
| TAG (20) | INDEX | OFFSET |
| | (6) | (6) |
+---------------------+--------+--------+
Lookup: INDEX picks 1 of 64 sets; compare TAG against all
8 ways in parallel; OFFSET picks the byte within the line.Bandwidth, not just latency — and HBM
Caches attack latency: they hide the long wait for any single value. But there's a second wall: bandwidth — how many bytes per second you can actually pump in. When a workload's working set is far too big to cache (think a neural network with billions of weights, or a graph that spills DRAM), caches stop helping and the machine streams data continuously. Now the limiter isn't 'how long is one trip' but 'how wide is the highway'.
The classic fix for bandwidth is wide, fast interfaces — but a normal DRAM chip on a circuit board can only have so many pins, and driving signals across centimetres of board burns power and limits speed. The modern answer is [[high-bandwidth-memory|High-Bandwidth Memory (HBM)]]: stack several DRAM dies vertically, connect them with thousands of vertical wires (through-silicon vias), and place the whole stack *millimetres* from the processor on a silicon interposer. Instead of a 64-bit-wide bus, you get a 1024-bit-wide interface per stack, running at modest clocks but moving terabytes per second.
This is the punchline that echoes through the rest of this track. Once compute became cheap and abundant — thousands of multiply units on one die — the scarce resource became *moving data to them*. That single realization is why a systolic array is shaped to reuse each loaded value across many operations, why AI accelerators are rated by memory bandwidth as much as by FLOPS, and why architects now measure designs in *energy per byte moved*, not just operations per second. The memory hierarchy isn't a footnote to the processor — increasingly, it *is* the processor's design.
Putting it together: the pipeline meets the hierarchy
Recall the instruction pipeline from the previous rung: a steady stream of instructions flows through fetch, decode, execute, memory, writeback. The whole illusion of one-instruction-per-cycle throughput collapses the instant a load misses and the pipeline has to stall for 200 cycles waiting on DRAM. Caches and the hierarchy exist to keep that stall rare — to feed the pipeline fast enough that the execution units stay busy.
Modern cores go further to fight the wall: prefetchers watch your access pattern and fetch the next line before you ask; out-of-order execution lets the core run later independent instructions while one load is stuck waiting; multiple outstanding misses let several DRAM trips overlap so their latencies hide behind each other. Each is a different way of saying the same thing: never let the beast sit idle waiting for memory.