BIST & Memory Test: Letting the Chip Test Itself

The tester that never sleeps: why bring the test inside?

In rung 4 you learned the deterministic way to test a chip: an ATPG tool computes a precise set of test patterns, a tiny program of zeros and ones crafted to catch stuck-at faults, and a room-sized automatic test equipment (ATE) machine shifts them into the scan chains and compares every captured response. It works beautifully — but it has two expensive habits. The patterns must be *stored* in the tester's memory, and they must be *shifted in and out* one bit at a time at the tester's modest clock rate. For a chip with tens of millions of scan cells and a thousand patterns, that is gigabits of vector data and seconds of slow shifting per part — and tester time is billed by the second, on a machine that can cost more than the fab tool that made the wafer.

Built-in self-test, or [[ic-built-in-self-test|BIST]], asks a liberating question: what if the chip carried its own tester on board? Instead of streaming patterns from outside, a small block of logic *generates* the stimulus internally and a second small block *compacts* the responses into a single short signature. The expensive ATE shrinks to a babysitter — it powers the chip up, pulses a 'go' signal, waits, and reads back a handful of bits: pass or fail. The stimulus is born and consumed on the die, at the die's own speed.

Logic BIST: a dice machine and a fingerprint reader

The classic architecture for testing random logic is called STUMPS — Self-Test Using a MISR and Parallel Scan-chains — and it has just two clever pieces bolted onto your existing scan infrastructure. The first is the dice machine. ATPG hand-crafts each bit; logic BIST does the opposite, flooding the scan chains with pseudo-random stimulus from a linear-feedback shift register (LFSR). An LFSR is a chain of flip-flops with a few XOR taps feeding back to its input; clock it and it marches through a long, repeatable, statistically-random-looking sequence of states — a maximal n-bit LFSR cycles through all 2ⁿ−1 non-zero values before repeating. It is a deterministic dice roll: cheap to build, and it never needs to store a single vector.

The second piece is the fingerprint reader. After each pattern propagates through the logic, the captured responses scan out — but instead of comparing every bit against a golden value (which would mean storing those golden bits, defeating the whole point), BIST funnels the entire stream into a multiple-input signature register (MISR). A MISR is an LFSR with extra XOR inputs; it churns the millions of response bits down into one short *signature*, typically 16 or 32 bits. This squeezing is called response compaction. At the end of the run you compare just that one signature against the single expected value. If even one response bit was wrong anywhere in the millions, the avalanche of XOR feedback almost always corrupts the final signature — so a 32-bit signature catches a faulty chip with all but a vanishing 1-in-4-billion probability of a lucky escape.

Logic BIST (STUMPS) data flow:

  +--------+   pseudo-random fill    +-----------------+
  |  LFSR  |------------------------>| scan chain 1    |
  | (PRPG) |------------------------>| scan chain 2    |--+
  +--------+        ...              |   ...           |  |
      ^                              | scan chain k    |  |
      | seed                        +-----------------+   |
      |                                  | (combinational | 
      |                                  |  logic under   |
      |                                  |  test)         |
      |                                  v                |
  +--------+   <-- compacted bits   +-----------------+   |
  |  MISR  |<-----------------------| captured resp.  |<--+
  +--------+                        +-----------------+
      |
      v
  32-bit SIGNATURE  ==  golden value ?   -> PASS / FAIL

LFSR example (4-bit, taps at bits 4 and 3, x^4 + x^3 + 1):
  next_in = q4 XOR q3
  q <= {next_in, q4, q3, q2}   // shift right, feed at top
  cycles through 15 non-zero states before repeating

MISR: same shift+XOR taps, PLUS each response bit XOR'd in:
  q[i] <= q[i+1] (^ tap) ^ response[i]
  -> millions of bits collapse into one short fingerprint

An LFSR sprays pseudo-random stimulus into the scan chains; a MISR compacts every captured response into one signature.

Random vs deterministic: the coverage gap and at-speed payoff

Pseudo-random stimulus is cheap, but it is not free of consequences. Most faults fall quickly — the first few hundred random patterns sensitise the easy nodes — but a stubborn minority are random-pattern-resistant. Think of a 10-input AND gate buried deep in the logic: to test a stuck-at-0 on its output you need all ten inputs high at once, a one-in-1024 chance per random shot, and the logic feeding those inputs may make even that probability far lower. Deterministic ATPG would simply *solve* for the one vector that does it; random BIST may need a million patterns and still miss it. This is the central trade: ATPG buys you very high fault coverage with few, expensive-to-store patterns; logic BIST buys you cheap, storage-free patterns that plateau at lower coverage unless you help them.

Engineers close that gap with two tricks. Test-point insertion adds a few control and observe points into those random-resistant cones so the dice can finally roll a winning combination — a small area cost that lifts coverage from, say, 92% to 99%. And hybrid / top-up patterns let ATPG generate a short deterministic set aimed precisely at the few faults BIST missed, sometimes encoded as LFSR seeds (reseeding) so they still ride the BIST machinery. The modern practice is rarely pure-anything: random BIST mops up the bulk, deterministic ATPG cleans the corners.

Memory BIST: why a million identical cells need a different test

Now point the same idea at an SRAM and the random dice machine falls apart. A memory is not a tangle of unique logic cones; it is an enormous, ruthlessly regular grid — millions of identical 6T bitcells sharing bitlines and wordlines, read through delicate sense amplifiers. Its failures are physical and *neighbourly*: a cell that can be written 0 but not 1, two cells shorted so writing one flips the other (a coupling fault), an address-decoder that lands on the wrong row, or a cell that slowly leaks its value away. Random patterns would waste effort and still miss these structured defects. Memory demands a stimulus that is the opposite of random — a precise, *algorithmic* sweep across every address in a deliberate order.

That algorithmic sweep is a march test. A march is a sequence of *march elements*, each of which visits every address in a defined order (ascending ⇑ or descending ⇓) performing a fixed pattern of reads and writes. The famous March C- runs six elements and catches the standard fault menagerie — stuck-at, transition, coupling, and address-decoder faults — in just 10n operations, where n is the number of cells. Because the work is linear in n, even a multi-megabit RAM finishes in milliseconds. A dedicated memory BIST (MBIST) controller — a small finite-state machine of address counters and data generators — drives this entirely on-chip, at full speed, with no vectors to store and no tester pins fighting their way to a deeply-buried memory port.

March C-  (10n operations, n = number of cells)

  element 0:  ⇕ (w0)        write 0 to every cell, any order
  element 1:  ⇑ (r0, w1)    ascending:  read 0, then write 1
  element 2:  ⇑ (r1, w0)    ascending:  read 1, then write 0
  element 3:  ⇓ (r0, w1)    descending: read 0, then write 1
  element 4:  ⇓ (r1, w0)    descending: read 1, then write 0
  element 5:  ⇕ (r0)        read 0 from every cell

  ⇑ ascending addresses   ⇓ descending   ⇕ either direction
  r0 = read expecting 0    w1 = write 1

What each catches:
  stuck-at        : any cell that won't hold both 0 and 1
  transition (TF) : a cell that can't make a 0->1 or 1->0 flip
  coupling (CF)   : writing cell A disturbs cell B  (the up/down
                    passes expose order-dependent coupling)
  address decoder : a write that lands on the wrong row/col

MBIST controller (sketch):
  FSM: for each march element
         for addr in (up | down):
           apply read/write ops; compare read data vs expected
         if mismatch -> log {addr, bit} to repair register

March C-: a deterministic, address-ordered read/write sweep that catches memory's structured faults in 10n operations.

Built-in self-repair: finding the bad cell is only half the job

Here is where memory test pulls a trick logic cannot. On a modern SoC, memories may occupy more than half the die area, and they are built at the bleeding edge of density — so they are the *first* thing to catch a stray particle or a marginal cell. If a single faulty bit out of ten million condemned the whole chip, yield would crater. So memory designers do something audacious: they build the array bigger than it needs to be, with a few spare rows and columns held in reserve. When MBIST finds a failing cell, the chip can *route around it*.

The mechanism is [[ic-redundancy-repair|memory redundancy and repair]], and it can run two ways. In *hard repair*, a companion block called BISR (built-in self-repair) takes the failing addresses MBIST logged, computes which spare rows or columns to swap in, and blows a set of one-time [[ic-efuse|eFuses]] so the remapping is burned in permanently — the part leaves the factory already fixed. In *soft repair*, the repair signature is recomputed and reloaded at every power-on, so a chip can heal newly-developed weak cells in the field. The result is profound: a chip that diagnoses its own defects and reconfigures itself to hide them, with the only outside help being the power supply.

Test. The MBIST controller runs its march algorithm across the array at speed and records every failing address and bit.
Analyse. The BISR engine reads the fail log and solves a small allocation problem: can the failures be covered by the available spare rows and columns?
Repair. If yes, it programs the remap — blowing eFuses for a permanent fix, or loading registers for a power-on soft repair — and the array now answers from the spares instead of the bad cells.
Re-verify. MBIST runs once more on the repaired array; only a clean pass lets the part be shipped. Failures beyond the spares' reach mark the die as a genuine reject.

Putting it together: the chip that wakes up and checks itself

Step back and the two halves snap into one picture. A modern SoC is a sea of logic interleaved with dozens or hundreds of memory macros. Logic BIST handles the irregular logic with pseudo-random LFSR stimulus and MISR compaction, topped up by deterministic ATPG where the dice can't reach. Memory BIST handles each regular array with deterministic march algorithms, backed by BISR repair so a few bad cells never sink the part. Both share the same philosophy you met at the top: move the test source and the response judge onto the die, hand the tester a one-bit verdict, and run at the chip's own speed.

The payoff is exactly the expensive habits we set out to break. Pattern storage shrinks toward nothing — an LFSR seed and a march algorithm replace gigabits of vectors. Slow shifting through the ATE gives way to full-speed on-chip clocking, so tester time per part drops and at-speed delay faults get caught. And because the self-test logic is permanent, the chip keeps the ability for life: a power-on self-test in your laptop, a periodic in-drive check in a hard disk, a safety-critical re-test every key-cycle in a car. The cost is honest — BIST and repair logic eat a few percent of area and must be carefully verified, X-controlled and timing-closed. But for the dense, deep, safety-conscious chips of today, letting the silicon grade its own homework is no longer a luxury. It is the default.