The memory wall, and why the board is the bottleneck
Imagine a champion chef who can chop, sear and plate a dish in ten seconds flat — but the only door to the pantry is a narrow corridor with one waiter shuffling ingredients in by hand. The kitchen's speed is no longer set by the chef; it is set by the corridor. That corridor is the problem at the heart of every high-performance chip today. A modern GPU or AI accelerator can perform tens of trillions of operations per second, but each operation needs operands fetched from memory, and the path to that memory has become the slowest link in the whole machine. We call this the memory wall.
Why is the path so slow? On a conventional board, the processor and the DRAM are separate packages soldered centimetres apart, talking over copper traces on a printed circuit board. Those traces are wide and far apart — measured in tens of microns to millimetres — so you can only fit a few hundred wires between two packages, and each runs at a modest data rate before reflections and crosstalk ruin the signal. Multiply width by rate and you get a bandwidth ceiling. For a workload that streams gigabytes of weights through the chip every millisecond, that ceiling is a brick wall.
Bandwidth = (number of parallel wires) x (data rate per wire)
Classic board-level DRAM channel
~64 data wires/channel x ~3.2 Gbit/s ≈ ~25 GB/s per channel
a few channels -> ~100-200 GB/s total
The ceiling comes from PHYSICS of the board:
- traces are wide (tens of um) -> few wires fit between packages
- traces are long (cm) -> high R, L, C -> slow, lossy, power-hungry
- drivers must shout across cm -> pJ per bit is high
Goal of 2.5D: replace the cm-long board hop with a um-short hop
and pack THOUSANDS of wires instead of hundreds.The interposer: a silicon circuit board under the dies
The centrepiece of 2.5D integration is the interposer — think of it as a tiny, ultra-high-resolution circuit board, but made of silicon (or sometimes an organic or redistribution-layer material) instead of fibreglass. In rung 2 you learned how a flip-chip die bonds face-down to a package using solder bumps, and how the finest of those, microbumps, connect a die to a carrier at pitches of around 40 µm or less. The interposer is that carrier. The logic die and its memory both flip-chip onto the top of the interposer with microbumps; the interposer's job is to wire them together with a density no organic board can match.
Why can silicon route so much more densely? Because an interposer is fabricated with the very same lithography that makes chips. Its wiring layers are patterned at micron and sub-micron line widths, so a single interposer can carry tens of thousands of parallel signals between neighbouring dies. That is the magic number: where a board offers hundreds of wires, the interposer offers tens of thousands, each only a millimetre or two long. The bandwidth ceiling lifts by one to two orders of magnitude, and it does so laterally — the dies sit side by side, not stacked, which is exactly why this is called 2.5D rather than full 3D.
SIDE VIEW of a 2.5D package (not to scale)
[ Logic die / GPU ] [ HBM stack ]
| | | | | | | | | | | | | | | <- microbumps (~40um pitch)
========================================== <- SILICON INTERPOSER
|| fine routing: 10000s of wires, ~1mm || (top: dense Cu wiring)
|| TSV TSV TSV TSV TSV TSV || (TSVs punch down)
==========================================
O O O O O O O O O O <- C4 solder bumps (~150um)
########################################## <- PACKAGE SUBSTRATE
o o o o o o o o o o o o o o o o o o <- BGA balls to the board
Two regimes of wiring:
TOP (interposer) = dense + short -> die-to-die signals
BOTTOM(substrate) = coarse + long -> power & off-package I/OThrough-silicon vias: punching power straight down
There is a problem hidden in that last diagram. If the dies sit on top of the interposer, and the interposer is a solid slab of silicon, how does power get up to them, and how do the slow off-package signals get back down to the substrate and out to the board? You cannot route everything around the edges. The answer is to drill straight through the silicon. A through-silicon via (TSV) is a vertical copper post — typically 5–10 µm in diameter and tens of microns tall — that passes clean through the interposer from top face to bottom face, electrically connecting the microbumps above to the C4 bumps below.
Making a TSV is genuine 3D micro-machining. Engineers etch a deep, narrow hole into the silicon (a high-aspect-ratio etch called the Bosch process), line it with an insulating barrier so the copper does not short to the conductive silicon, then fill it with electroplated copper. The wafer is later ground thin from the back so the bottoms of the vias are exposed and can be bumped. The result is thousands of vertical wires carrying power, ground and the comparatively slow I/O signals down to the substrate — while the fast die-to-die chatter stays up top in the fine routing, never having to take the trip down.
Anatomy of a Through-Silicon Via (TSV)
microbump o <- die / top-side routing
--|-- top metal
+========+
| Cu fill | <- electroplated copper post
| (5-10um | diameter 5-10 um
| wide) | height 50-100 um
| | aspect ratio ~ 10:1
barrier liner | <- insulator: keeps Cu from shorting to Si bulk
+========+
--|-- bottom metal (exposed by backgrind)
C4 bump O <- down to the package substrate
WHAT TSVs CARRY (mostly): power, ground, slow off-package I/O
WHAT STAYS ON TOP: fast die-to-die signals in fine routingHBM: a memory skyscraper with a wide front door
Now for the star of the worked example. High-bandwidth memory (HBM) is the partner die that 2.5D was practically invented to host. Where ordinary DRAM is a flat chip soldered far away on the board, HBM is a vertical stack of DRAM dies — typically 8, 12 or 16 of them — bonded one on top of another and threaded together by their own TSVs, sitting on a base logic die. It is a memory skyscraper: instead of spreading the storage out across the board, you build it upward and park the whole tower right next to the processor.
The reason a stack helps is the same lesson again — short, fine, numerous wires. Stacking the DRAM dies and running TSVs up through them means the memory presents an enormously wide interface to the world: an HBM stack exposes roughly a thousand-bit-wide data bus, versus the 64 bits of a board-level channel. That ultra-wide bus only works because the interposer underneath can fan out a thousand-plus wires across the millimetre gap to the logic die without ever touching the board. HBM and the interposer are co-designed: the wide door only matters if there is a wide corridor to meet it.
Worked example: one logic die + 4 HBM stacks on a silicon interposer
+---------+ +---------+
| HBM 0 | | HBM 1 | each HBM stack:
+---------+ +---------+ ~1024-bit data bus
\ / stacked DRAM + TSVs
+-----------------+ on a base logic die
| LOGIC / GPU |
+-----------------+
/ \
+---------+ +---------+
| HBM 2 | | HBM 3 |
+---------+ +---------+
<----- all on ONE interposer ----->
Bandwidth per stack (illustrative HBM-class numbers):
1024 bits x ~6.4 Gbit/s/pin / 8 ≈ ~800 GB/s per stack
4 stacks -> ~3.2 TB/s aggregate
Compare a board-level DRAM subsystem: ~0.1-0.2 TB/s
-> roughly a 15-30x bandwidth jump, at LOWER energy per bit.2.5D vs 3D, and the price of all that silicon
It pays to be precise about why this is 2.5D and not 3D. In 2.5D the active dies all sit side-by-side on a shared interposer; the only thing stacked is the passive interposer beneath them (and the DRAM tier inside an HBM stack). True 3D integration — the subject of the next rung — goes further: it stacks active logic dies directly on top of one another, bonding them face-to-face or face-to-back, often with hybrid bonding that fuses copper pads with no solder at all. 2.5D spreads heat out and keeps the dies separable; 3D wins on the shortest possible vertical wires but fights brutal heat and bonding challenges. 2.5D is the mature, in-production workhorse; 3D is the steeper frontier just beyond it.
None of this is cheap. A silicon interposer is, itself, a large piece of silicon fabricated with chip-grade lithography, and a big interposer is hard to make without defects — its yield drops as its area grows, and a flagship accelerator may demand an interposer larger than a single lithography exposure can print, forcing exotic stitching. The economic lever is the interposer's reach and density: how far across it the fine wires can run, and how many you can pack. More reach lets you place more HBM stacks and bigger logic dies; more density lifts the bandwidth. Both cost area, and area on silicon costs money. That is the core trade-off of 2.5D.
That cost pressure is exactly why the industry is hunting for cheaper carriers. Some 2.5D packages swap the full silicon slab for a small silicon bridge embedded in an otherwise organic substrate — fine routing only where two dies actually meet, coarse organic everywhere else. Others use a thick redistribution-layer (RDL) interposer built up without a silicon wafer at all. All of them serve the same goal that motivates this whole rung: deliver interposer-class wiring between dies, ideally without paying for interposer-class silicon everywhere. This is the engine of modern heterogeneous integration and chiplet design, standardised by die-to-die interfaces like UCIe.
Putting it together: the assembly flow
Step back and the whole 2.5D recipe is a sequence of bonds at shrinking pitches, each one a translator between a finer world above and a coarser world below. Here is the canonical flow for a logic-plus-HBM module.
- Fabricate the interposer wafer: pattern the fine top-side copper routing and etch, line and fill the TSVs, then backgrind to expose the via bottoms.
- Pre-test every die. The logic die and each HBM stack must each pass as a known-good die — you only want proven parts touching that expensive interposer.
- Bond the dies onto the interposer top with fine microbumps (~40 µm pitch), then flow underfill under each die to relieve thermal stress on those tiny joints.
- Attach the populated interposer to the package substrate with coarse C4 bumps (~150 µm pitch) — the TSVs now carry power up and slow I/O down to the substrate.
- Add the lid and heat spreader, attach the substrate's solder balls to the world, and the module is ready to drop onto a board — presenting terabytes-per-second of internal memory bandwidth behind a normal-looking package.