From Instructions to Pipelines: How a Processor Actually Runs Code

The tower of abstraction: from a transistor to a thought

Hold a modern phone in your hand and you are holding one of the most layered objects humans have ever built. At the very bottom is the transistor — a microscopic switch, just a voltage flipping a channel on or off. Wire a handful of them together in the right pattern and you get a logic gate that computes a single bit of Boolean logic: AND, OR, NOT. Stack gates into adders and registers, stack those into a processor, etch a few tens of billions of them onto one integrated circuit — and somewhere up that tower, the raw physics of charge starts to behave like *arithmetic*, then like *code*, then like a video call with your grandmother. Computer architecture is the study of one crucial floor in that tower.

Why does that one floor deserve its own discipline? Because it is the floor where two enormous worlds have to meet. Below it live the circuit and chip engineers, who think in voltages, gate delays, and silicon area. Above it live the programmers, who think in variables, loops, and functions. Neither side wants to know the other's details — a JavaScript developer should never have to care how a transistor switches, and a chip designer should be free to rebuild the silicon completely without breaking a single existing program. Architecture is the diplomatic layer that lets both be true at once.

The ISA: a contract no one is allowed to break

How do you let two worlds meet without forcing either to know the other? You write a contract. In computing that contract is the instruction set architecture, or ISA. It is a precise, published promise that says: *here is the complete list of instructions this processor understands, exactly what each one does, what registers exist, and how numbers are laid out in memory.* x86, ARM, and RISC-V are the famous ones. The ISA is deliberately a thin description of behaviour, never of construction — it tells you *what* the machine does, and pointedly refuses to say *how*.

That refusal is the whole genius. Because the ISA freezes only the behaviour, Intel and AMD can build wildly different chips that both run x86; Apple can redesign its ARM cores generation after generation, doubling performance, while your apps — compiled years ago — keep running untouched. The contract is the only thing both sides agree on, and it lets the hardware underneath be reinvented endlessly. An instruction itself is just a binary number, a pattern of bits the chip decodes — a tiny fragment of the Boolean machinery interpreting a code.

  C source           one ISA instruction        the actual bits
  ----------         -------------------        ---------------
  z = x + y;   --->   ADD  R3, R1, R2     --->   0000000 00010 00001 000 00011 0110011
                      "add R1 and R2,            (a 32-bit RISC-V word the
                       put result in R3"          decoder will pick apart)

  The ISA is the middle column: a fixed vocabulary the compiler
  targets and the hardware promises to obey. Change the silicon
  freely -- just keep honouring this column.

The ISA sits between the language you write and the bits the chip runs. The compiler turns your code into ISA instructions; the hardware turns those into voltages — and the ISA is the only fixed point.

Fetch, decode, execute: the heartbeat of every CPU

Strip a processor down to its soul and you find a loop so simple a child could follow it, repeated billions of times a second. The CPU keeps one special pointer, the program counter, that holds the memory address of the next instruction. Then forever it does three things: fetch the instruction at that address, decode it to figure out what it commands, and execute it — do the arithmetic, touch memory, or jump elsewhere. Then it nudges the program counter forward and does it all again. That is the fetch-decode-execute cycle, and it is the literal heartbeat of computing.

Let's make it physical with our ADD. The instruction lives in memory as a binary word; the processor's datapath is the network of wires, gates, and storage that ferries it through those three steps. The fast scratchpad it computes on is the register file — a small bank of registers, each holding one word, far quicker to reach than main memory. A classic teaching design splits the cycle into five neat stages, and following ADD through them is the clearest way to see a CPU think.

IF — Instruction Fetch. The program counter points at our ADD; the processor reads that 32-bit word out of instruction memory and increments the counter to point at the next instruction.
ID — Instruction Decode. Decode logic reads the bit pattern as Boolean fields: this is an ADD, its sources are registers R1 and R2, its destination is R3. Those two source registers are read out of the register file at the same time.
EX — Execute. The two values flow into the arithmetic logic unit (ALU) — a block of adders built from gates — which adds them and produces the sum. This is where the actual computing happens.
MEM — Memory access. A plain ADD doesn't touch memory, so it idles through this stage. (A load or store instruction would read or write data memory here — which is exactly why the stage exists.)
WB — Write Back. The sum is written back into destination register R3 in the register file. The instruction is complete; the result is now visible to whatever instruction needs R3 next.

The pipeline: a laundromat for instructions

Run instructions one at a time and you waste almost everything. While the ALU is busy in the EX stage, the instruction-fetch hardware sits idle; while WB writes a result, the ALU naps. It's like doing laundry by washing one load, drying it completely, folding it, and only *then* starting the next load — the washer stands cold the whole time the dryer runs. Nobody does laundry that way. You start washing load two the instant load one moves to the dryer. Every machine stays busy, and a stack of laundry that would take all day in sequence is done in a fraction of the time.

That is exactly the instruction pipeline. Split the datapath into the five stages, drop a tiny register between each pair to hold the in-progress instruction, and let all five stages work at once — each on a *different* instruction. As ADD enters EX, the instruction behind it enters ID, and a fresh one enters IF. The instructions march through the stages in lockstep, one step every clock tick, like cars down an assembly line. No single ADD finishes faster — it still takes five stages — but a finished instruction now *rolls off the end every single cycle* instead of every five.

  Cycle:      1     2     3     4     5     6     7     8     9
            +-----------------------------------------------------+
  ADD   R3  | IF | ID | EX |MEM | WB |                            |
  SUB   R6  |    | IF | ID | EX |MEM | WB |                       |
  AND   R8  |    |    | IF | ID | EX |MEM | WB |                  |
  OR    R9  |    |    |    | IF | ID | EX |MEM | WB |             |
  XOR  R10  |    |    |    |    | IF | ID | EX |MEM | WB |        |
            +-----------------------------------------------------+
            ^                 ^
          fill-up         steady state: ONE instruction finishes
          (latency)       every clock cycle (throughput)

  5 instructions, no pipeline:  5 x 5 = 25 cycles
  5 instructions, pipelined:    9 cycles   (~2.8x faster, and
                                 the gap only grows with more)

Five instructions overlapping in a 5-stage pipeline. After the pipe fills, a result emerges every cycle — the same hardware, the same per-instruction latency, but up to ~5x the throughput.

Why speed is two numbers, not one: frequency × IPC

Ask a beginner why one chip is faster than another and the answer is usually a single number: gigahertz. That's half the truth, and the missing half is where modern design lives. The whole pipeline marches to one drumbeat — the clock signal, a square wave ticking at a fixed frequency. Each tick, every instruction advances one stage. A 3 GHz clock ticks three billion times a second, so its period is just one-third of a nanosecond. Faster clock, more ticks per second, more instructions pushed through. So far, so gigahertz.

But there's a second lever, just as powerful: how many instructions you finish *per* tick. Our simple pipeline manages, at best, one per cycle — an IPC (instructions per cycle) of 1. A clever superscalar processor fetches and executes several instructions every cycle, reaching an IPC of 4 or more. Real performance is the product of the two, and a famous identity nails it down: program time equals instruction count, times cycles per instruction, times the time per cycle. The chip designer can push on all three.

      Execution time  =  N_instructions  x  CPI  x  T_clock
                      =  N_instructions  x  CPI  / f_clock

      where  CPI = cycles per instruction (= 1 / IPC)
             f_clock = clock frequency

  Worked example -- a program of 1 billion instructions:

    Chip A: f = 4 GHz, IPC = 1     -> time = 1e9 x 1   / 4e9 = 0.250 s
    Chip B: f = 3 GHz, IPC = 2     -> time = 1e9 x 0.5 / 3e9 = 0.167 s

  Chip B is SLOWER in gigahertz yet 1.5x FASTER overall --
  because it does more work per tick. Frequency is only half
  the story; IPC is the other half.

The performance equation. The lower-frequency chip wins because higher IPC more than compensates — which is exactly why the gigahertz race quietly gave way to the IPC race.

This is why, around 2005, clock speeds stopped climbing the way they had for decades. Cranking the frequency higher burned power and heat faster than chips could shed it — a wall every later rung will return to. So designers pivoted to the other lever: more work per cycle, more cores, smarter pipelines, deeper instruction-level parallelism. Every one of those tricks is, at bottom, a way to raise IPC without melting the chip. The base machine you've now met — ISA, cycle, pipeline, frequency, IPC — is the thing all of them optimize.

The base machine, and the road ahead

Step back and see what you now hold. You can trace a line of code down through the tower: from a statement, to an ISA instruction, to a binary word, to electrons flipping transistors in gates. You can walk that instruction through fetch-decode-execute, watch it ripple across five datapath stages, read and write the register file, and understand why overlapping those stages in a pipeline multiplies throughput. And you can reason about speed with the right two-handed grip: clock frequency and IPC together. That is the base machine — the simple, honest CPU that every spectacular optimization is bolted onto.