Going Wide: ILP, Multicore and the End of Free Lunch

The free lunch that lasted thirty years

Imagine writing a program in 1985 that took ten seconds to run. You do nothing — change not a single line — and five years later it finishes in two seconds. Wait another five years and it's under a second. This actually happened, for decades, and programmers came to expect it the way you expect the sun to rise. It even had a name, coined by Herb Sutter in 2005: the free lunch. The engine behind it was two intertwined trends. Moore's law doubled the number of transistors-worth of circuitry on a chip every couple of years, and Dennard scaling said that as transistors shrank, their power density stayed constant — so you could clock them faster without melting the chip.

Higher clock frequency was the most direct lever: a 100 MHz chip becoming a 3 GHz chip is, all else equal, thirty times faster on the same program. But architects had a second, subtler trick. Even when the clock couldn't go faster, they could make the processor do more work per clock cycle — and that is the story of instruction-level parallelism.

Squeezing parallelism out of one instruction stream

Back in rung 2 we met the pipeline: an assembly line where each instruction marches through fetch, decode, execute, memory, and write-back stages. A pipeline already overlaps several instructions, but it still only *finishes* one per cycle in the best case. Instruction-level parallelism (ILP) goes further by noticing that nearby instructions are often independent — they don't depend on each other's results — so why not start more than one at the same time?

A superscalar processor does exactly this. It has multiple execution units — say two integer ALUs, a load/store unit, and a floating-point unit — and it fetches and issues several instructions every cycle. A modern core is 4-to-8-wide, meaning it can begin up to eight instructions per clock. The CPI, instead of being stuck at 1, can drop below 1 (we usually flip it and quote IPC, instructions per cycle).

But programs aren't written as neat independent bundles. Instruction 5 often needs the result of instruction 3. If the processor issues strictly in program order, one stalled instruction blocks everyone behind it — like a single slow car holding up a whole lane. The fix is out-of-order execution: the core keeps a pool of decoded instructions (a reservation station / reorder buffer) and runs whichever ones have their inputs ready, then quietly retires the results back in the original program order so the programmer never sees the reshuffling.

; Program order — instr 3 stalls on a cache miss
  I1: add  r1, r2, r3
  I2: mul  r4, r1, r5
  I3: load r6, [r7]      ; <-- misses cache, ~100+ cycles
  I4: sub  r8, r9, r10   ; independent of I3 — why wait?
  I5: xor  r11, r8, r12  ; depends on I4, not I3

In-order core:   I3 stalls -> I4, I5 stall too (lane blocked)
Out-of-order:    I4, I5 execute WHILE I3's load is in flight
                 results retire in order 1,2,3,4,5 -> program sees no change

Out-of-order execution lets independent work (I4, I5) proceed during a long [[ic-cache-memory|cache]] miss on I3, hiding latency the in-order core cannot.

When the physics ran out: Dennard's end and dark silicon

Around 2005 the free lunch was abruptly cancelled, and the culprit was power. Dennard scaling had promised that shrinking a transistor let you lower its voltage proportionally, keeping power per unit area flat. But you can only lower voltage so far before transistors stop switching cleanly and leakage current — current that flows even when the transistor is 'off' — explodes. Below roughly the 90 nm node, threshold voltage stopped scaling, leakage soared, and Dennard scaling broke. Suddenly, cranking the clock higher meant more power and more heat with nowhere to go.

Power ~= C * V^2 * f   (dynamic switching power)

  Dennard era:  shrink -> V down, C down  -> power/area constant
                so f could rise 'for free' each generation

  Post-2005:    V stuck (~0.7-1.0 V), leakage rises
                -> pushing f up just burns more watts as heat

  Thermal wall: ~100-150 W/cm^2 is about all you can cool
                in a cheap package -> clocks freeze near 3-4 GHz

Dynamic power scales with the square of voltage. Once voltage stopped shrinking, frequency hit a thermal wall.

Moore's law, however, did not stop in 2005 — transistors kept shrinking and doubling for another decade. This created a strange new gap. You could fit more transistors than ever onto a die, but you could no longer afford to *power them all at once* within the thermal budget. The result is dark silicon: at any given moment, a growing fraction of the chip must sit idle or run at low voltage, because lighting it all up would exceed the cooling limit.

The pivot: many cores instead of one fast core

If you can't make one core much faster, the obvious move is to put several of them on the same die. A multicore processor does just that: two, four, eight, or dozens of complete cores sharing a package, often a shared last-level cache and a memory controller. Why does this dodge the power wall? Because power scales roughly with the cube of frequency at the high end (you must also raise voltage to hit higher clocks). Two cores at 2 GHz can deliver more throughput than one core at 4 GHz, for *less* total power — provided your work can be split into two parallel pieces.

That last clause is the catch, and it is brutal. Doubling the cores does *not* double your speed, because almost no real program is perfectly parallel. The part that must run sequentially — reading input, the bookkeeping between parallel phases, a final merge — caps everything. This is Amdahl's law, and it deserves a worked number.

Amdahl's law:   Speedup(N) = 1 / ( (1 - p) + p/N )
   p = fraction of work that is parallelisable
   N = number of cores

Suppose p = 0.95 (95% parallel, 5% stubbornly serial):

   N = 2     Speedup = 1 / (0.05 + 0.95/2)   = 1.90x
   N = 8     Speedup = 1 / (0.05 + 0.95/8)   = 5.93x
   N = 32    Speedup = 1 / (0.05 + 0.95/32)  = 12.5x
   N = 1000  Speedup = 1 / (0.05 + 0.95/1000) = 19.5x
   N = inf   Speedup = 1 / 0.05               = 20x   <-- HARD CEILING

The 5% serial part alone caps you at 20x, no matter how many cores.

Even a 95%-parallel program can never exceed a 20× speedup — the serial remainder dominates as N grows.

The new headache: keeping caches honest across cores

Multicore solved a power problem but created a correctness problem. In rung 3 you learned that each core keeps a private L1 cache — a fast local copy of recently used memory. Now picture two cores, each with its own cache, both holding a copy of the same variable `x = 5`. Core 0 writes `x = 6` into *its* cache. Core 1, reading from *its* stale cache, still sees `5`. They now disagree about reality. If software can't trust that all cores see the same memory, parallel programming becomes impossible.

The hardware solution is a cache-coherence protocol, most commonly a flavour of MESI. Every cache line carries a state — Modified, Exclusive, Shared, or Invalid — and the cores snoop a shared bus or directory to coordinate. When Core 0 wants to write `x`, it must first acquire exclusive ownership, which invalidates every other core's copy. Core 1's next read then misses, forcing it to fetch the fresh value. Coherence is automatic and invisible to software, but it is not free: it costs bus traffic, energy, and latency.

MESI in action — two cores share variable x (starts: both Shared, x=5)

  Core0                       Core1
  ----                        ----
  read  x   -> S, x=5         read  x  -> S, x=5
  WRITE x=6 -> needs M
     |-- broadcast invalidate -->| line goes I (invalid)
     -> Core0 now M, x=6        | (copy thrown away)
                                read x -> MISS
     |<-- snoop, supply x=6 ----|  Core0 line -> S
                                -> Core1 S, x=6   (now consistent)

The invalidate + refetch is the hidden tax of every shared write.

A write forces an invalidate broadcast; the other core's next read misses and refetches. This 'coherence traffic' is why heavily-shared data can make more cores run *slower*.

Why this rung matters: the road to specialisation

Step back and watch the strategy shift across three eras. First, architects bought speed with frequency — until the thermal wall stopped them. Then they bought it with width and out-of-order cleverness (ILP) — until the parallelism in ordinary code ran dry. Then they bought it with many cores (multicore) — until dark silicon and Amdahl's law made identical general-purpose cores a losing bet. Each lever, in turn, hit a wall.

Frequency era (~1990–2005): ride Dennard scaling, raise the clock. Ends at the power/thermal wall.
ILP era: go wider and out-of-order to shrink CPI. Ends when ordinary code exposes only a few independent instructions.
Multicore era (~2005 onward): replicate cores for throughput. Ends against Amdahl's law and dark silicon.
Specialisation era (now): fill the dark silicon with domain-specific accelerators that do one job at far better performance-per-watt.

That fourth era is where the next rungs live. When you can't power every transistor and you can't find more parallelism in general code, the winning move is to stop being general. A GPU, a tensor engine, a video codec block — each abandons flexibility to chase enormous efficiency on a narrow task. The history you just read is precisely why the industry pivoted from scaling one kind of brain to building a zoo of domain-specific architectures. Understanding the walls that ILP and multicore hit is the prerequisite for understanding why specialised silicon is the dominant idea of the coming decade.