Fault Models & ATPG: Generating the Test Patterns

From 'is it broken?' to a number you can grade

Imagine you've just received a crate of a million tiny brass padlocks straight from the factory and your job is to ship only the working ones. You can't disassemble each lock. The only thing you can do is try keys in the keyhole and watch whether the shackle pops. So you face a deeper question first: *what kinds of defects can a padlock even have?* A stuck pin? A snapped spring? A cross-threaded screw? Once you list the possible ways a lock can fail, you can design a small set of key-turns that, between them, would reveal every one of those failures. That list of imagined failures is exactly what a fault model is for a chip.

Here is the central trick of all manufacturing test. Physical defects are infinitely varied — a particle landed here, an oxide grew too thin there, a metal line necked down somewhere else. There is no hope of enumerating every microscopic mishap. So we don't try. Instead we replace the messy physics with a small, *countable*, idealized list of logical faults: definite, named misbehaviours of the netlist that we can simulate, target, and tick off one by one. A fault model is the bridge from 'something physical might be wrong' to 'here is a finite list of named things I will hunt for.'

The workhorse: the stuck-at fault

Of all the fault models ever proposed, one has dominated industry for fifty years because it is gloriously simple yet catches an astonishing share of real defects. It is the single [[ic-stuck-at-fault|stuck-at fault]]. The idea: pick any one node in the netlist — an input pin of a gate, an output of a gate, a wire — and pretend it is permanently frozen. Stuck-at-0 (SA0) means that node always reads logic 0 no matter what; stuck-at-1 (SA1) means it is jammed at logic 1. That's the entire model. Each individual node gives you two faults (SA0 and SA1), so a design with N nodes has roughly 2N stuck-at faults to chase.

Why does something so crude work so well? Because a huge family of physical defects *looks like* a stuck node from the logic's point of view. A short to the ground rail pins a wire to 0. A short to the power rail pins it to 1. A broken (open) gate input often floats to a value that behaves like a stuck-at. The stuck-at model is a kind of universal stand-in: chase it diligently and you sweep up most manufacturing reality as a side effect. Decades of silicon have proven the bargain pays off.

A SINGLE STUCK-AT FAULT lives on a node and freezes it:

         good circuit                 faulty circuit (node n SA0)
      a ──┐                         a ──┐
          │ AND ── n ── ...             │ AND ── n  <stuck-at-0> ── ...
      b ──┘                         b ──┘     (n reads 0 forever,
                                                no matter what a,b are)

   SA0  =  node permanently = 0   (e.g. shorted to GND)
   SA1  =  node permanently = 1   (e.g. shorted to VDD)

   N nodes  ->  ~2N stuck-at faults in the whole fault list.
   A 10-million-gate block can host tens of millions of faults.

The stuck-at fault: one node, frozen at 0 or 1. Two faults per node, millions per chip.

Control and observe: the two halves of detection

To catch any fault, a test pattern must do two jobs, and missing either one makes it useless. First it must excite (or *activate*) the fault: force the suspect node to the value *opposite* to where it's stuck, so that a good chip and a faulty chip would now disagree at that node. To test a node for stuck-at-0 you must drive it to 1; for stuck-at-1 you must drive it to 0. Second, the pattern must propagate that disagreement along a path of gates until it reaches an output you can actually watch — a pin or a scan chain flop. A difference that never reaches an observable point is a difference nobody can see.

Those two jobs have names you'll see everywhere: controllability is your ability to *set* an internal node to a chosen value from the inputs, and *observability* is your ability to *see* that node's value at an output. A node that is hard to control or hard to observe is hard to test — and the entire reason scan was invented (last rung) is to crank both of these up by turning every flip-flop into a directly settable, directly readable test point.

ATPG: the engine that invents the patterns

Now the headline act. ATPG — automatic test pattern generation — is the EDA tool that takes the gate-level netlist, the fault list, and the scan structure, and *automatically computes* the actual input values that detect each fault. You don't write tests by hand; you press a button and the tool reasons its way to them. For one target fault it must solve a constraint puzzle: choose primary-input and scan-flop values so that (a) the fault site is driven to the activating value, and (b) a sensitized path carries the effect to an output. Then it does this for the next fault, and the next, millions of times.

Classic ATPG algorithms have wonderful names — the D-algorithm (Roth, 1966) introduced the *D* symbol meaning '1 in the good circuit, 0 in the faulty one' (and D̄ for the reverse), so a single character tracks the good/faulty disagreement as it threads through gates. PODEM and FAN made the search far faster by being smart about which decisions to try first. Modern industrial ATPG often hands the hard cases to a SAT solver, encoding 'is this fault detectable?' as a Boolean satisfiability problem. The flavour varies; the goal never does — find inputs that excite-and-propagate, or prove that none exist.

Pick a target fault from the list (say, node g3 stuck-at-1).
Excite it: justify the *opposite* value (drive g3 to 0) by choosing input/scan values that force it there.
Sensitize a path: pick a route of gates from g3 to an output along which the disagreement (a D) can travel uninterrupted.
Solve the constraints together; if they conflict, backtrack and try other choices (this is where the search can get expensive).
On success, record the resulting input values as a test pattern — and fault-simulate it to see which *other* faults it happens to catch for free.
Drop every newly-detected fault from the list and repeat until the list is empty or remaining faults are proven undetectable.

A tiny worked example: one pattern, one fault

Let's make it concrete with a circuit you can hold in your head. Two AND gates feed an OR gate: g1 = a AND b, g2 = c AND d, and the output y = g1 OR g2. Suppose we want to test the node g1 for stuck-at-0 — the wire out of the first AND gate is welded to 0. Walk the two halves of detection by hand.

Excite: to expose a stuck-at-0, we must make g1 want to be 1. Since g1 = a AND b, the only way is a = 1 and b = 1. In a good chip g1 = 1; in the faulty chip g1 = 0. We've created the disagreement — that is our D (1/0). Propagate: the disagreement sits at g1, an input to the OR gate y = g1 OR g2. An OR gate only forwards a change on one input if its *other* input is 0 (because OR-with-1 is always 1, which would mask the difference). So we must set g2 = 0, which we get by making c = 0 (or d = 0). Now y = g1 OR 0 = g1, so y is 1 in a good chip and 0 in the faulty one. The fault is now visible at the output.

Circuit:   g1 = a AND b
           g2 = c AND d
           y  = g1 OR g2

Target fault:  g1  stuck-at-0

Step 1 EXCITE  (drive g1 to the opposite, 1):  a = 1, b = 1
Step 2 PROPAGATE through OR (need other input = 0): c = 0 (d = X)

Resulting TEST PATTERN:   a b c d = 1 1 0 X

             | good chip | faulty chip (g1 SA0)
   g1        |     1     |     0
   g2        |     0     |     0
   y  (out)  |     1     |     0        <-- they DISAGREE
              ^^^^^^^^^^^^^^^^^^^^^^
        observe y: good=1, faulty=0  => fault DETECTED

Apply 1,1,0,X and read y. y=1 -> this fault is absent.
                          y=0 -> the part is bad, scrap it.

One pattern, a b c d = 1 1 0 X, both excites g1 SA0 and propagates it to the output y.

Notice three things this miniature taught you. The 'X' on d means *don't care* — ATPG leaves bits unconstrained whenever it can, which gives the fault simulator freedom and helps one pattern cover more faults. Setting g2 = 0 to let g1's change shine through is the OR gate's non-controlling value — every gate type has one, and knowing them is the heart of path sensitization. And the very same pattern, by the magic of fault simulation, almost certainly detects several *other* faults along the way (a stuck-at-0 on a or b, for instance). One clean pattern, many faults retired.

Coverage, untestables, and faults beyond stuck-at

When ATPG finishes, it reports the prize: fault coverage — the percentage of modeled faults the generated patterns detect. Production chips routinely demand stuck-at coverage of 99% or higher, and for safety-critical automotive or medical parts the bar climbs toward 99.9%+. Why so fierce? Because the gap between 99% and 99.9% is the gap between shipping one defective part in a hundred and one in a thousand — and at billions of units, every tenth of a percent is a flood of field failures, recalls, or worse. Coverage is the single number that decides whether a test is good enough to bet a product on.

Finally, the stuck-at workhorse is famous but not alone, because modern silicon fails in ways a frozen node can't capture. Transition (delay) faults model a node that *does* switch but too *slowly* — it eventually reaches the right value, yet misses the clock edge, so the chip works at low speed but fails at rated frequency. Catching these needs *at-speed* tests: two rapid vectors that launch a transition and capture the result one fast clock later. Bridging faults model two neighbouring wires shorted together so they fight or wired-AND/OR. And cell-aware or transistor-level models dig inside the standard cell itself to target defects the gate-level netlist can't even see. Each model is another lens; production test plans stack several to push real defect escape rates down to single-digit parts-per-million.