3D Stacking and Chiplets: Building Systems from Pieces

The wall that side-by-side hit

In rung 3 you learned to put dies *next to* each other on an interposer — the 2.5D world, where a HBM stack sits beside a GPU and they talk through thousands of fine wires in the silicon underneath. That trick bought us enormous bandwidth. But notice the geometry: even when two dies are millimetres apart, a signal must travel *out* of one die, *across* the interposer, and *into* the other. Millimetres are an eternity at gigahertz speeds. Every extra millimetre of wire adds capacitance, which means more energy per bit and more delay.

So engineers asked a deceptively simple question: what if, instead of placing dies side by side, we placed them *face to face* or stacked them like pancakes? Then the path between transistors on two different dies could be a few microns, not a few millimetres — a thousand times shorter. That is the leap from 2.5D to true vertical integration, the 3D IC.

Going vertical: TSVs, microbumps, and hybrid bonding

To stack die B on top of die A and still get signals between them, you need a way to punch a conductor *through* the body of the silicon. That conductor is the through-silicon via (TSV) — a copper-filled hole, often only a few microns wide, drilled vertically through a thinned die. Think of it as an elevator shaft running through a building so the floors can be wired together without going outside and climbing a ladder.

But a TSV only moves a signal *through* one die. To join two stacked dies you also need contacts at the surface where they meet. The first generation used microbumps: tiny solder balls, maybe 20–40 µm apart, that reflow and weld the top die's pads to the bottom die's pads. Microbumps are the same idea as the flip-chip bumps from rung 2, just shrunk dramatically. They work, but solder has limits — you can only make balls so small before neighbouring ones bridge, and solder adds resistance and a tiny gap.

The frontier technique removes the solder entirely. In copper-to-copper hybrid bonding, the two dies are polished mirror-flat, the oxide surfaces are bonded directly (like two clean glass slides sticking together), and embedded copper pads on each face are aligned and fused by a heat step until they become one continuous piece of copper. There is no ball, no gap — just metal meeting metal. This pushes the connection pitch from ~40 µm (microbumps) down toward 1 µm or less, raising the density of die-to-die connections by *hundreds of times* and slashing the resistance and capacitance of each link.

  2.5D (side-by-side on interposer)        3D stack (hybrid bonded)
  -----------------------------------       --------------------------
   [ GPU ]  ~mm of wire   [ HBM ]            +========+  die B (top)
      |________________________|             | copper |  <-- ~1 um pitch
          interposer (passive)               +========+  hybrid bond
   ====================================       | copper |
             ~mm path, more C, more energy   +========+  die A (bottom)
                                              | TSV  | |  signals exit below
                                              +------+-+
   3D vertical path:  a few microns  =>  ~1000x shorter, ~lower energy/bit

2.5D routes millimetres sideways; a 3D hybrid-bonded stack routes microns straight up.

Why break a big chip into pieces?

Now zoom all the way out. Stacking solves *distance*. But there is a second, even more brutal force reshaping chip design: economics, driven by yield. Yield is the fraction of dies on a wafer that come out working. Defects fall on a wafer roughly at random — say a few specks of dust per square centimetre. A small die might dodge them all. A huge die almost certainly catches at least one, and one fatal defect kills the *entire* die.

The mathematics is unforgiving. Defects scale with *area*, so yield falls roughly exponentially as a die grows. Double the area and you don't just lose a bit more — you can lose most of your good dies. Worse, the biggest designs bump into the reticle limit, the maximum area (~858 mm²) a lithography scanner can print in one shot. You physically cannot make a monolithic die bigger than that.

Toy yield model (Poisson):  Y = exp(-A * D)
   A = die area (cm^2),  D = defect density (defects/cm^2),  D = 0.1

   One monolithic die, A = 6.0 cm^2:   Y = exp(-0.60) = 0.55   (45% scrapped!)
   Split into 4 chiplets, A = 1.5 cm^2 each:
        each chiplet  Y = exp(-0.15) = 0.86
        => far more good dies per wafer; toss a 1.5 cm^2 reject, not a 6 cm^2 one

Smaller dies catch fewer defects — and when one fails you throw away a small, cheap piece, not a giant one.

This is the economic engine behind the chiplet revolution. Instead of one giant monolithic SoC, you *disaggregate* it: chop the design into several smaller dies, manufacture each one separately, test them, and keep only the good ones — the known-good dies. Then you re-integrate the survivors into one package. You're trading a single expensive lottery ticket for a handful of cheap ones, and only paying for winners.

Mix and match: heterogeneous integration

Splitting for yield is only half the prize. Once a chip is a collection of pieces, those pieces no longer have to be made the same way. This is heterogeneous integration: each chiplet can come from the process node that suits *it* best, and the pieces are then assembled into one system.

Why does that matter? Different circuits scale differently. Logic — the transistors doing arithmetic — keeps shrinking beautifully on each new node and *loves* the latest (and most expensive) EUV process. But analog circuits, high-voltage I/O drivers, and SRAM have largely stopped shrinking; paying for cutting-edge silicon to build them is pure waste. So you fabricate the dense compute logic on the newest node, the I/O and analog on a cheaper mature node, perhaps the memory on a memory-specialized process — and bolt them together. You get the best of every world without forcing one process to do everything.

A worked example: CPU compute dies + one IO die

Make it concrete with the architecture that put chiplets on the map — a server CPU built from two kinds of die. The compute dies (CCDs) hold nothing but CPU cores and cache, fabricated on the most advanced node so the cores run fast and dense. A separate I/O die (IOD) holds everything that doesn't shrink well: the memory controllers, the PCIe and USB physical interfaces, the inter-socket links. It's built on a cheaper, older node because there's no benefit to doing otherwise.

Fabricate separately. Make many small compute dies on the leading node and IO dies on a mature node. Each die is small, so each wafer yields lots of good ones.
Test before assembly. Probe every die and sort out the known-good dies. You only want to spend assembly cost on pieces you already know work.
Bin and mix. A fast compute die goes into a premium part; a slightly slower one into a cheaper SKU. The *same* IO die serves both. One product line is harvested from the same parts bin.
Re-integrate. Place one or more compute dies plus the IO die on a package substrate (or interposer), connect them die-to-die, and seal the package.

                package substrate / interposer
   +---------+   +---------+         +-------------------+
   |  CCD 0  |   |  CCD 1  |  ...    |     IO die (IOD)  |
   | cores + |   | cores + |  <==>   | mem ctrl, PCIe,   |
   |  cache  |   |  cache  |  d2d    | USB, socket links |
   | (newest |   | (newest |  link   |  (mature node)    |
   |  node)  |   |  node)  |         +-------------------+
   +----+----+   +----+----+              ^   ^   ^
        |  d2d links (UCIe)              DDR PCIe USB  to the outside world
        +-------------------------------+
   Build a 4-core, 8-core, ... or 64-core part by changing how many CCDs you drop in.

Compute dies on the newest node + a shared IO die on a mature node, joined by standardized die-to-die links. Scale core count by adding CCDs.

Look at what this buys you. You scale from a 4-core chip to a 64-core monster *by placing more compute dies* — no new design, no new mask set, no reticle-limit wall. The expensive leading-node silicon is spent only on cores. And because each die is small, your wafers are full of known-good dies instead of half-scrapped giants. The same idea generalizes far beyond CPUs: GPUs and AI accelerators now tile many compute chiplets next to (and atop) memory and I/O the very same way.

Making the pieces snap together: UCIe and co-optimization

There's a catch hiding in all this elegance. If your compute die from one vendor and your IO die from another can't agree on *how* to talk across the gap — voltage, pitch, protocol, error handling — then chiplets are just custom one-off pairings, not a marketplace. For decades, each company's die-to-die link was a private dialect. The industry answer is UCIe (Universal Chiplet Interconnect Express), an open standard that defines the physical bumps/pitches, the electrical signaling, and the protocol so that a chiplet from one supplier can plug into another's package — a 'PCIe moment' for the inside of the package.

There's one last shift in mindset to absorb. When a system is split across many dies and stacks, you can no longer design the chip first and the package later. The split itself — which functions go on which die, where the thermal hot-spots land, how power gets delivered up a 3D stack, how die-to-die links are budgeted — has to be decided together, up front. This holistic approach is system-technology co-optimization (STCO): optimizing the silicon, the package, and the partition as one problem instead of three.