When Pipelines Stall: Hazards, Forwarding and Branch Prediction

The assembly line that lies a little

In rung 1 we cut instruction execution into five stages — fetch, decode, execute, memory, write-back — and let five instructions ride the conveyor belt at once. The headline number was glorious: an instruction pipeline with depth five promises a throughput of one instruction per clock cycle (1 IPC), even though any single instruction still takes five cycles to crawl through. That is the same magic as a car factory: each car takes a day to build, yet a finished car rolls off the line every few minutes.

But the factory analogy hides a lie. A car door does not need to know what colour the *next* car's engine will be. Instructions are not so polite. Instruction B often needs a number that instruction A is still computing; a branch instruction decides whether C, D and E should even exist. When the line is forced to wait, we call the event a pipeline hazard, and every hazard drags the real IPC below that pretty 1.0.

Three ways a pipeline gets stuck

Architects sort hazards into three families, and the names tell you exactly what is being fought over.

Structural hazard — two instructions want the *same hardware* in the same cycle. Imagine fetch and memory-access both reaching for a single memory port. The clean fix is to not share: split instruction and data memory (the classic Harvard-style L1 caches), or add a second read port to the register file. Mostly an architect's problem, solved once at design time.
Data hazard — an instruction needs a *value* that an earlier, not-yet-finished instruction is producing. This is the everyday hazard, and the rest of this guide mostly hunts it.
Control hazard — the pipeline does not yet know *which instruction comes next* because a branch has not resolved. Fetch is greedy and never wants to stop, so it must guess. Guess wrong and you throw work away.

A data hazard, traced cycle by cycle

Let's make it painfully concrete. Take two RISC-style instructions where the second reads what the first just wrote:

  ADD  x3, x1, x2     # x3 = x1 + x2
  SUB  x5, x3, x4     # x5 = x3 - x4   <-- needs the new x3

SUB reads x3, which ADD has not finished computing.

Now lay them on the pipeline grid. Each instruction marches IF → ID → EX → MEM → WB, one stage per cycle, driven by the clock. The ADD does not actually *store* its new x3 until its WB stage in cycle 5. But the SUB tries to *read* x3 during its ID stage in cycle 3 — two cycles too early.

cycle:     1    2    3    4    5    6
ADD x3:   IF   ID   EX   MEM  WB
SUB x5:        IF   ID   EX   MEM  WB
                    ^         ^
                    |         x3 finally written here (WB)
                    SUB reads x3 here (ID) -- STALE!

The classic read-after-write (RAW) hazard: the reader catches up to the writer.

The naive fix is to simply stall: freeze SUB (and everything behind it) until ADD's value is safely written, then let it go. Stalls are inserted as bubbles — empty stages that do nothing. Correct, but each bubble is a wasted cycle, and in this case we would waste two of them. Do this on every dependent pair and your 1.0 IPC collapses toward 0.5 or worse.

Forwarding: ship the answer early

Here is the clever observation. The ADD has *already computed* x3 by the end of its EX stage in cycle 3 — the number is sitting right there in the ALU output latch. The only reason SUB has to wait is the bookkeeping rule that says "values live in the register file." So break the rule. Run a private wire — a bypass — straight from the ALU output back to the ALU input, and feed SUB the fresh x3 directly, skipping the register file entirely. This trick is called forwarding (or bypassing).

cycle:     1    2    3    4    5
ADD x3:   IF   ID   EX---MEM  WB
SUB x5:        IF   ID   EX   MEM  WB
                        ^
                        x3 forwarded EX->EX, no stall!

With a bypass path, SUB grabs x3 in cycle 4 with zero bubbles.

A small forwarding unit watches the pipeline: "Does the instruction in EX need a source register that an instruction in MEM or WB is about to produce? If so, mux that fresh value in instead of the stale register read." It is a tiny controller of comparators and multiplexers, and it rescues the overwhelming majority of data hazards for free.

Branches: gambling on the future

Data hazards are local squabbles. Control hazards threaten the whole pipeline's right to exist. When the machine hits a conditional branch — `if (x > 0) goto L` — it cannot know which way to go until the comparison resolves, deep into EX. But fetch will not idle; it has to load *something* into IF on the very next cycle. Which address? The branch target, or the fall-through? It is genuinely undecided.

Stalling until the branch resolves is correct but brutal: in a 5-stage pipe that is roughly 2–3 dead cycles on *every* branch, and real code branches once every five or six instructions. So instead the CPU predicts the outcome and speculatively fetches down the guessed path. This is branch prediction, and on modern cores it is astonishingly good.

Static prediction — a fixed rule baked in, e.g. "backward branches are taken" (loops jump back, so this is right most of the time). Cheap, no learning.
Dynamic prediction — a small table of saturating counters indexed by branch address. Each entry, often a 2-bit state machine (strongly/weakly taken/not-taken), remembers how that branch behaved last time and updates after the real outcome.
Modern predictors — TAGE, perceptron-based, and neural-style designs correlate long histories of recent branches and routinely exceed 95–99% accuracy, which is what keeps deep pipelines alive.

The price of a wrong guess

Prediction is a gamble, and gambling has a downside. When the branch finally resolves in EX and the guess was wrong, every instruction fetched after the branch is garbage — work done on a road that does not exist. The pipeline performs a flush: it squashes those speculative instructions into bubbles, redirects fetch to the correct address, and starts over. The wasted cycles are the misprediction penalty.

Branch resolves in EX (stage 3). On a miss, IF and ID work is trashed:

BR:    IF   ID   EX*  ...        (* = outcome known, was MISpredicted)
wrong: --   IF   ID  <-- FLUSHED (squashed to bubbles)
wrong: --   --   IF  <-- FLUSHED
correct target:       IF   ID   EX ...   (refetch starts here)

  Penalty ~= pipeline stages before resolve  (here ~2 cycles)
  Deep CPU (15-20 stages): 15-20 wasted cycles per miss!

A flush turns mispredicted work into bubbles; deeper pipes pay far more.

This is why pipeline depth is a double-edged sword, and it pins down a real design tension. A simple 5-stage core loses only a couple of cycles per miss. A high-frequency, 15-to-20-stage core can lose 15–20 cycles every time, so even a 2% miss rate quietly bleeds performance. We can express the damage as an effective cost on the instruction stream:

Average CPI  =  1.0 (ideal)
             +  branch_frequency x mispredict_rate x penalty

Example (deep core):
   branch_frequency = 0.20   (1 in 5 instructions)
   mispredict_rate  = 0.05   (95% accurate predictor)
   penalty          = 18 cycles

   added CPI = 0.20 x 0.05 x 18 = 0.18
   -> ~18% slower than the 1.0-CPI dream, from branches alone.

Even a 'good' 95% predictor leaves real money on the table in a deep pipe.