Feeding the Beast: Caches and the Memory Hierarchy

The race the processor keeps losing

Imagine a master chef who can chop, sear, and plate a dish in one second flat — but the only pantry is a warehouse a ten-minute walk away. It doesn't matter how fast her knife is; she spends her whole shift walking. That is exactly the situation a modern CPU is in. The core's arithmetic units can retire several instructions per nanosecond, yet a trip to main memory to fetch one operand takes 60 to 120 nanoseconds. In the time it takes DRAM to answer once, the core could have executed hundreds of instructions. The beast is starving while it waits for food.

This is the [[memory-wall|memory wall]], and it grew out of a simple divergence. For decades processor clock speeds climbed steeply — partly thanks to Dennard scaling, which let each smaller transistor switch faster at the same power density. DRAM, by contrast, is built around a tiny capacitor that must be charged and sensed; its raw access latency has barely improved in thirty years. So the two curves separated. CPUs got ~1000× faster; DRAM latency improved maybe 10×. The gap is the wall.

Why a staircase works: locality

If memory accesses were truly random — every byte equally likely at every moment — no trick could help; you'd pay the DRAM penalty almost every time. We are saved by a deep empirical truth about real programs called locality, and it comes in two flavours.

Temporal locality — if you touched an address now, you'll probably touch it again soon. Loop counters, the top of the stack, a hot lookup table: programs return to the same handful of locations over and over.
Spatial locality — if you touched an address, you'll probably touch its neighbours soon. Walking through an array, executing instructions in sequence, reading a struct field-by-field: accesses cluster in nearby addresses.

Locality is what makes a memory hierarchy pay off. We build a staircase of stores, each step bigger and slower than the one above it. The topmost steps are tiny and blisteringly fast; the bottom is huge and slow. We keep the recently- and soon-to-be-used data near the top. Because of temporal locality, the data we just used is still up there when we need it again. Because of spatial locality, we fetch data in *blocks* (a whole 64-byte line at once), so grabbing one byte cheaply pre-loads its neighbours.

Level        Typical size     Latency (cycles)   Made from
-----        ------------     ----------------   ---------
Registers    ~ 1 KB total     0                  flip-flops in datapath
L1 cache     32-64 KB         ~4                 fast 6T SRAM, split I/D
L2 cache     256 KB-2 MB      ~12-15             SRAM, per-core
L3 cache     8-64 MB          ~40                SRAM, shared by all cores
Main memory  8-128 GB         ~200-400           DRAM (1T1C cells), off-chip
SSD / NVMe   0.5-8 TB         ~ 100,000+         flash, non-volatile

A representative hierarchy for a modern desktop/server core. Each step down is roughly 3–10× slower and ~10× larger. L1 uses an [[ic-sram-bitcell|SRAM bitcell]]; main memory uses a [[ic-dram-cell|DRAM cell]].

A worked example: hit, miss, and the cost of a stride

Let's make the numbers concrete. A cache line is the unit of transfer — say 64 bytes, which holds 16 four-byte integers. When the core asks for an address, the cache checks whether the line containing it is already present. If yes, it's a hit (fast); if no, a miss — the cache fetches the whole 64-byte line from the next level down and pays the penalty.

// Sum a 1,000,000-element int array (4 bytes each, 64-byte lines)
// => 16 ints per line.
long sum = 0;
for (int i = 0; i < N; i++) sum += a[i];

Access pattern, sequential (stride 1):
  a[0]  MISS  -> fetch line, brings a[0..15]
  a[1]  hit   a[2]  hit   ...   a[15]  hit
  a[16] MISS  -> fetch next line, brings a[16..31]
  ...
  => 1 miss per 16 accesses  ->  miss rate = 1/16 = 6.25%

Effective access time (AMAT):
  AMAT = hit_time + miss_rate * miss_penalty
       = 4 cyc  + 0.0625 * 200 cyc
       = 4 + 12.5  =  16.5 cycles  (average per element)

Sequential access turns one expensive miss into 16 cheap hits — spatial locality in action. The Average Memory Access Time (AMAT) formula is the lens you'll use for every caching decision.

Now break the locality. Suppose you walk the same array but with stride 16 — touching `a[0], a[16], a[32], …` — perhaps because you're reading one column of a matrix stored row-major. Now *every* access lands on a fresh line, so every access misses:

Stride-16 access:
  a[0]  MISS   a[16] MISS   a[32] MISS  ...  every one a miss
  => miss rate ~ 100%
  AMAT = 4 + 1.00 * 200 = 204 cycles per element

  Slowdown vs. sequential:  204 / 16.5  ~=  12x  SLOWER

Same instructions. Same arithmetic. Same data.
Only the *order of access* changed.

The exact same loop runs ~12× slower purely because the access pattern defeated spatial locality. This is why data layout (row-major vs column-major, array-of-structs vs struct-of-arrays) can dominate real performance.

How a cache decides where things go: associativity

A cache is much smaller than memory, so many memory addresses must compete for the same storage slots. *How* an address maps to a slot is the question of associativity, and it's a classic speed-vs-flexibility trade-off. Think of it as a coat-check.

Direct-mapped — every coat has exactly one assigned hook, decided by its number. Lookup is trivially fast (check one place), but two popular coats assigned to the same hook keep evicting each other even when the rest of the rack is empty. This is a *conflict miss*.
Fully associative — any coat can hang on any hook. Zero conflict misses, perfect flexibility — but to find a coat you must check *every* hook in parallel, which is expensive in power and area. Only tiny structures (like a TLB) afford this.
Set-associative — the sweet spot. The rack is divided into small sets (say 8 hooks each); a coat's number picks its set, then it may use *any* hook within that set. An '8-way set-associative' cache checks 8 candidates per lookup. Real L1/L2/L3 caches live here, typically 4-, 8-, or 16-way.

When a set is full and a new line arrives, the cache must evict someone — this is the replacement policy. The intuitive choice is LRU (Least Recently Used): kick out whoever hasn't been touched for longest, betting on temporal locality. True LRU is costly to track for high associativity, so real designs use cheap approximations (pseudo-LRU trees, re-reference prediction). The principle stays: evict the line least likely to be needed again soon.

Address breakdown for a 32 KB, 8-way, 64-byte-line L1 (32-bit addr):

  line size  = 64 B            -> offset = log2(64)  = 6 bits
  num lines  = 32768 / 64 = 512
  num sets   = 512 / 8 (ways) = 64
                               -> index  = log2(64)  = 6 bits
  tag        = 32 - 6 - 6                = 20 bits

  +---------------------+--------+--------+
  |        TAG (20)      | INDEX  | OFFSET |
  |                     |  (6)   |  (6)   |
  +---------------------+--------+--------+

  Lookup: INDEX picks 1 of 64 sets; compare TAG against all
  8 ways in parallel; OFFSET picks the byte within the line.

How an address is sliced to find a line. The index selects a set; tags of all ways in that set are compared at once; the offset picks the byte. This compare-in-parallel is exactly why higher associativity costs more power and area.

Bandwidth, not just latency — and HBM

Caches attack latency: they hide the long wait for any single value. But there's a second wall: bandwidth — how many bytes per second you can actually pump in. When a workload's working set is far too big to cache (think a neural network with billions of weights, or a graph that spills DRAM), caches stop helping and the machine streams data continuously. Now the limiter isn't 'how long is one trip' but 'how wide is the highway'.

The classic fix for bandwidth is wide, fast interfaces — but a normal DRAM chip on a circuit board can only have so many pins, and driving signals across centimetres of board burns power and limits speed. The modern answer is [[high-bandwidth-memory|High-Bandwidth Memory (HBM)]]: stack several DRAM dies vertically, connect them with thousands of vertical wires (through-silicon vias), and place the whole stack *millimetres* from the processor on a silicon interposer. Instead of a 64-bit-wide bus, you get a 1024-bit-wide interface per stack, running at modest clocks but moving terabytes per second.

This is the punchline that echoes through the rest of this track. Once compute became cheap and abundant — thousands of multiply units on one die — the scarce resource became *moving data to them*. That single realization is why a systolic array is shaped to reuse each loaded value across many operations, why AI accelerators are rated by memory bandwidth as much as by FLOPS, and why architects now measure designs in *energy per byte moved*, not just operations per second. The memory hierarchy isn't a footnote to the processor — increasingly, it *is* the processor's design.

Putting it together: the pipeline meets the hierarchy

Recall the instruction pipeline from the previous rung: a steady stream of instructions flows through fetch, decode, execute, memory, writeback. The whole illusion of one-instruction-per-cycle throughput collapses the instant a load misses and the pipeline has to stall for 200 cycles waiting on DRAM. Caches and the hierarchy exist to keep that stall rare — to feed the pipeline fast enough that the execution units stay busy.

Modern cores go further to fight the wall: prefetchers watch your access pattern and fetch the next line before you ask; out-of-order execution lets the core run later independent instructions while one load is stuck waiting; multiple outstanding misses let several DRAM trips overlap so their latencies hide behind each other. Each is a different way of saying the same thing: never let the beast sit idle waiting for memory.