Chiplets & Heterogeneous Integration

Why split a chip up?

Every rung below this one has been a fight against a single wall. Power stopped scaling when Dennard scaling ended around 2006, so we could no longer just crank the clock. Wires got slower relative to transistors as interconnect scaling stalled. And the cost of a new process node kept climbing even as Moore's Law limped on. For decades the reflex answer to "I need more chip" was the same: draw a bigger single die — one monolithic block of silicon holding the CPU cores, the cache, the I/O, the analog, everything. This guide is about why that reflex finally broke, and the alternative the whole industry has pivoted to.

The alternative is disaggregation: take the one big integrated circuit and split its functions across several smaller dies — chiplets — then wire them tightly together inside one package so they still behave, electrically and to software, like a single product. Think of it as the difference between casting one enormous, flawless engine block versus bolting together a set of well-made modules. The modules are easier to build, easier to test, and — crucially — you can swap one without re-casting the whole thing. The rest of this guide is the *why* behind that trade, one limit at a time.

Yield & the cost of big dies

Here is the limit that makes disaggregation an economic *necessity*, not just a nice idea. When you print a wafer, random defects land on it — a stray particle, a flaw in the photolithography, a bad spot in a metal layer. Those defects are sprinkled roughly uniformly across the wafer's area. So the bigger each die is, the higher the chance that at least one defect lands *inside* it and kills it. Yield falls as die area grows — and it falls faster than linearly, because area itself grows as the square of the die's edge.

Flip that around and you get the chiplet insight: cut one big die into four smaller ones and a defect now ruins only the *one* small die it landed on, not the whole product. The three good neighbours survive. You also waste less silicon at the wafer's round edge, where big rectangular dies get clipped. The same defect density, sliced finer, turns a costly low-yield monster into several high-yield pieces — and yield is most of what determines the cost of a chip.

  ONE BIG MONOLITHIC DIE                SAME FUNCTION, SPLIT INTO CHIPLETS
  (one defect kills the lot)            (a defect kills only its own tile)

  +-----------------------------+       +---------+   +---------+
  |        x  (defect)          |       |  CPU x  |   |  CPU    |   x = defect
  |   CPU   CPU   CPU   CPU      |       | (scrap) |   |  (good) |
  |                             |       +---------+   +---------+
  |   CACHE        I/O          |  -->   +---------+   +---------+
  |                             |       |  CACHE  |   |   I/O   |
  |   ANALOG       MEM-CTRL     |       | (good)  |   |  (good) |
  +-----------------------------+       +---------+   +---------+
  big area -> low yield, all-or-nothing  small dies -> high yield, lose 1 tile
      [ ============ one package: chiplets sit side-by-side ============ ]

One large die is all-or-nothing: a single defect scraps everything. Split the same function into chiplets and a defect costs you only the tile it hit — the good tiles still ship, assembled together in one package.

Known-good die

Splitting into small dies only pays off if you can throw away the bad ones *before* you spend money assembling them. That is the idea of a known-good die: each chiplet is fully tested on its own — powered up, exercised, screened across voltage and temperature — and certified working *before* it is ever placed into a package. You assemble only from a bin of proven-good parts.

Why this matters is brutal arithmetic. Suppose you bond four untested dies into one package and just one is faulty — the whole expensive assembly, including the three good dies and the costly packaging step, is scrap. The more dies you stack or place, the more savage this gets: yield multiplies. With known-good die testing you break that chain, because every part entering assembly has already passed. This is also why advanced packaging is inseparable from the chiplet idea — the package is now a place where tested dies are joined, so the test step has to come first.

Fabricate each chiplet on its own wafer, on whatever process best suits it.
Test every die individually — at speed, across voltage and temperature — and tag the ones that pass.
Assemble the package using only known-good die, so a faulty part never reaches the costly bonding step.
Test the finished multi-die package as a whole, to catch anything the assembly itself introduced.

Mixing nodes

Disaggregation unlocks something a monolithic die can never do: each chiplet can be built on the process node that suits it best. This is heterogeneous integration — mixing different technologies in one package. On a single die, *everything* is forced onto the same node, even the parts that gain nothing from it.

And many parts gain nothing. Fast logic genuinely benefits from the newest, most expensive node — its transistors are smaller and switch faster, whether they are FinFET or the newer gate-all-around nanosheets. But large SRAM caches have nearly stopped shrinking, analog and I/O circuits often work *better* on older, cheaper, well-understood nodes, and a MOSFET-based power or RF block has no reason to ride the bleeding edge at all. Forcing them all onto a single leading-edge die means paying the highest price per square millimetre for circuits that don't want it. Split them into chiplets and you put the dense logic on the latest CMOS node while leaving cache, analog, and I/O on mature processes — best-fit, block by block.

  HETEROGENEOUS PACKAGE: each chiplet on its best-fit node

  +-------------+   +-------------+   +-------------+
  | COMPUTE     |   | COMPUTE     |   |   I/O +     |
  | (logic)     |   | (logic)     |   |   ANALOG    |
  | leading-edge|   | leading-edge|   | mature node |
  | GAA node    |   | GAA node    |   | (cheaper)   |
  +-------------+   +-------------+   +-------------+
        |                 |                 |
  +-----------------------------------------------+
  |   SRAM CACHE chiplet (node that barely shrinks)|
  +-----------------------------------------------+
   pay top dollar only where it actually buys speed

Heterogeneous integration: spend the expensive leading-edge node only on the logic that benefits, and leave cache, analog, and I/O on cheaper mature nodes — all stitched together in one package.

UCIe: a standard socket

Once a product is several chiplets, a new question appears: *how do they talk to each other?* For years every vendor invented its own private die-to-die interface, which meant chiplets from different companies could not be mixed. That is the limit UCIe — Universal Chiplet Interconnect Express — sets out to remove. It is an open, industry-standard specification for the die-to-die link: the electrical signalling, the physical bump layout, and the protocol all defined in common.

The analogy that captures it: before UCIe, connecting two chiplets was like wiring two devices with a custom, soldered cable you designed yourself. UCIe turns that into a standard socket — a USB-style agreement that any compliant chiplet can plug into any compliant neighbour. A short, dense link inside the package replaces the long board traces of the old world, so two chiplets can exchange data at near on-die bandwidth and a fraction of the energy. Physically, those links may ride a silicon interposer or be joined face-to-face by hybrid bonding, where two dies are bonded copper-pad to copper-pad with no solder bumps at all.

The chiplet economy

Put the pieces together and chiplets stop being just a packaging trick and become a *business model*. When dies are small, separately tested, mixed across nodes, and joined over a standard socket, a chiplet turns into a reusable building block — a Lego brick rather than a one-off casting. A design team can build a compute chiplet once and drop it into a dozen different products, pairing it with different I/O or memory chiplets each time, instead of taping out a fresh monolithic die for every variant.

This is also the natural home for domain-specific architecture: a CPU chiplet, a GPU/NPU accelerator chiplet, and an I/O chiplet can each be designed by whoever does that job best, then mixed and matched into a custom product — the package, not the die, becomes the unit of design. It echoes the same modular logic you met one rung down, where a place-and-route flow assembles a chip out of pre-built standard cells; chiplets simply raise that idea up a level, from cells inside a die to whole dies inside a package.