Why Chips Need Testing: Defects, Yield & DFT

A perfect design is not a perfect chip

Imagine you have spent two years designing a processor. Every line of Verilog is verified, every timing path closes, the layout passes every check. You send the design to the foundry — this is called tapeout — and weeks later the wafers come back. You would like to believe that because the design is correct, every chip on those wafers is correct too. That belief is wrong, and understanding why it is wrong is the reason this entire track exists.

A modern chip is built by stacking and patterning dozens of microscopically thin layers of semiconductor, metal and insulator onto a silicon wafer. The smallest features are a few nanometres wide — far thinner than a virus. Building something that small, billions of times over, in a process with hundreds of steps, is a manufacturing miracle. But miracles are not perfect. Sometimes a step goes slightly wrong, and the result is a physical flaw on the silicon called a defect.

What goes physically wrong on the wafer

To feel why testing is unavoidable, picture the three most common ways silicon goes bad. None of them is a logic mistake; all of them are physics.

A particle. A speck of dust just a fraction of a micron across lands on the wafer during one of the patterning steps. Where it sits, the pattern is ruined — like a hair landing on a photo before it is printed. One particle can kill one die.
An open. A wire that should carry a signal is broken — a hairline crack in a metal line, or a via that didn't connect between two layers. Now part of the chip is electrically dead, stuck and unable to change.
A short. Two wires that should be separate are accidentally joined — a sliver of leftover metal bridges them. Now they fight each other, and a node that should be free to swing between 0 and 1 is jammed.
Process variation. Even with no obvious flaw, no two transistors come out identical. Thicknesses, doping and widths vary slightly across the wafer. Most variation is harmless, but at the edges a transistor can be just slow enough — or leaky enough — that the chip fails to meet spec.

Yield: the chip's quiet economic engine

Because defects are random and unavoidable, only a fraction of the dies on each wafer actually work. That fraction is called yield. Yield is not a side detail — it is the number that decides whether a chip is profitable. A wafer costs the same to manufacture whether 95% of its dies are good or 40% are good, so every dead die is pure loss spread across the survivors.

A 300 mm wafer, dies are 10 mm x 10 mm  ->  roughly 700 dies/wafer
Wafer cost (advanced node)              ~  $15,000

If yield = 90%:   630 good dies   ->  $15000 / 630  =  ~$24 per good die
If yield = 50%:   350 good dies   ->  $15000 / 350  =  ~$43 per good die
If yield = 20%:   140 good dies   ->  $15000 / 140  =  ~$107 per good die

Same silicon, same design. Yield alone moves the cost ~4x.

Yield is leverage: the cost of a good chip is the wafer cost divided by how many dies survive. This is why foundries obsess over defect density.

Now here is the part that motivates test directly. The cost of a defect does not stay flat — it explodes the further downstream it escapes. There is a rough industry heuristic sometimes called the rule of ten: catching a bad die at wafer test might cost cents; if it slips through and gets packaged, you've thrown away the package too; if it lands on a circuit board, you scrap the board; and if it reaches a customer in a car or a phone, the cost of the recall can be thousands of times the price of the chip.

Why you can't just poke the pins

So we must test every die. The obvious idea is: connect the chip to a tester — an automatic test equipment (ATE) machine — apply some inputs to the pins, watch the outputs, and check that they match what a good chip would produce. For a chip with a handful of gates, that works beautifully. For a modern chip, it falls apart completely.

The trouble is scale. A modern chip can hold billions of transistors but only a few hundred to a few thousand pins. The pins are your only doors into the chip, and there are billions of rooms inside. Worse, those rooms are connected in deep chains: to make a flip-flop buried twenty logic levels deep take a specific value, you'd have to find a sequence of pin inputs that, after rippling through all the logic in front of it, happens to set it — and then find some other input combination that lets the result of that flip-flop ripple all the way back out to a pin where you can see it.

      pins (a few hundred)
        |   |   |
   +----v---v---v------------------------------+
   |  [logic] -> [FF] -> [logic] -> [FF] -> ... |   billions of
   |     ^                  ^                   |   internal nodes,
   |   how do I FORCE this node to 1?           |   buried deep
   |              and SEE this node's value?     |
   +--------------------------------------------+
        |   |   |
      pins (a few hundred)

  Controllability:  can I set an internal node to the value I want?
  Observability:    can I see an internal node's value at a pin?
  From the pins alone, for most deep nodes the answer is: barely.

The controllability/observability problem: a few pins, billions of hidden nodes. Testing from the outside is like inspecting a skyscraper through its mail slots.

Setting an internal node to the value you want is called controllability; seeing an internal node's value from outside is called observability. From the pins alone, deep nodes have almost none of either. The number of input patterns you'd need explodes faster than any tester could ever apply in a reasonable time. Pure pin-poking simply does not scale — and that wall is the central problem this track is built to climb. We name it properly in the next rung: controllability & observability.

Design for Test: building the chip so it can be tested

If you can't test a billion-transistor chip from the outside, the answer is radical but simple: change the chip. Add circuitry whose only job is to make those buried nodes reachable. This philosophy is Design for Test, almost always shortened to DFT. Crucially, DFT is decided *while you design the chip*, not afterward — you spend a small slice of silicon area and a few percent of performance up front, in exchange for being able to test the chip at all.

The flagship DFT trick — which you'll meet in detail in later rungs — is to quietly rewire the chip's memory elements so that during test mode they can all be chained together into one long shift register. Suddenly you can shift any pattern you like *into* every internal flip-flop (perfect controllability) and shift their captured values back *out* to a pin (perfect observability), all through just a couple of extra pins. The deep, hidden nodes become as reachable as the pins themselves.