Thermal, Electromigration and Reliability: Will It Last?

The bench lies: working ≠ lasting

Through this track you learned how to attach a die (flip-chip vs wire bond), how to spread its pins onto a substrate, how to fan a thousand signals out through an interposer, and how to stack dies vertically in a 3D IC. Every one of those choices made the package *work*. None of them, by itself, makes it *last*. A part that boots, runs a benchmark, and passes at-speed test has cleared a bar measured in seconds. The customer expects it to clear a bar measured in years — ten years for a phone, fifteen for a car, longer for a satellite that can never be touched again.

Three slow killers stand between *works today* and *works in 2040*: heat that has nowhere to escape, electromigration that erodes metal one atom at a time, and thermal-cycling fatigue that cracks solder a fraction of a micron with every wake-and-sleep. They are slow precisely because they are statistical and cumulative — which is exactly what makes them dangerous. You cannot see them on a scope. You can only model them, accelerate them, and qualify against them.

Heat: every watt has to leave

Start from conservation: a chip turns electrical energy into heat, and in steady state every watt that goes in must come out, or the temperature climbs without bound. Thermal management is the engineering of that escape path. The currency is thermal resistance, written θ (theta) in kelvin-per-watt — the temperature rise you pay per watt you push through a layer. It behaves exactly like an electrical resistance, and the whole heat path is a series-resistor network you can solve with Ohm's law for heat: ΔT = P · θ.

Heat path of a flip-chip package (junction -> ambient), as a resistor ladder:

  Tj (junction, the hot transistor)
   |
  [ θ_jc ]   die silicon + bumps + underfill   ~0.1-0.3 K/W
   |
  [ θ_TIM ]  thermal interface material (paste/solder)  ~0.05-0.2 K/W
   |
  [ θ_lid ]  copper lid / integrated heat spreader      ~0.05 K/W
   |
  [ θ_hs ]   heatsink + airflow (or cold plate)   ~0.1-0.5 K/W
   |
  Ta (ambient air or coolant)

Worked example -- a 150 W CPU, θ_ja(total) = 0.25 K/W:
   Tj = Ta + P * θ_ja
      = 45 C + 150 W * 0.25 K/W
      = 45 + 37.5 = 82.5 C    <- comfortably below the ~105 C limit

Now stack the power, not the cooling -- 250 W into the SAME 0.25 K/W path:
   Tj = 45 + 250 * 0.25 = 45 + 62.5 = 107.5 C   <- over the limit. Throttle.

Junction-to-ambient is a series resistor ladder. Lower any θ in the chain and the whole junction temperature drops; the same path that cooled 150 W cooks at 250 W.

What makes modern packages hard is not total power but heat-flux density — watts per square millimetre. A GPU compute die can dissipate over 1 W/mm², and hot-spots inside it spike far higher. Heat is happy to leave a thin, wide die; it is miserable leaving a *tall* one. That is the curse of the 3D IC: when you stack two or three dies, the bottom die's heat must climb up through the dies above it to reach the lid, and silicon plus the bonding interfaces add θ at every floor. The logic die at the bottom of an HBM-style stack is the worst seat in the house — buried, power-hungry, and farthest from the heatsink.

And heat does not act alone. Hotter silicon leaks more current; more leakage means more heat; more heat means more leakage. That positive feedback is thermal runaway, and in a poorly cooled stack it can latch the part into a meltdown. The job of thermal design is to keep the loop gain below one — to make sure the cooling path bleeds heat away faster than leakage can pour it in.

Electromigration: a wind that blows atoms

Push enough current through a metal wire and the electrons stop being a polite fluid and start acting like a sandblaster. Each conduction electron carries momentum, and when it scatters off a metal atom it gives that atom a tiny shove in the direction of electron flow. One collision does nothing. A current density of millions of amps per square centimetre, sustained for years, does something profound: it slowly pushes copper atoms downstream. This is electromigration — the ‘electron wind’ literally relocating the wire.

Where atoms pile up you get a hillock that can short to a neighbour; where atoms drain away you get a void that thins the wire until it opens. Both are death, and both are gradual. The famous summary is Black's equation, which tells you the median time-to-failure of a wire and reveals the two knobs that matter most:

Black's equation (median time-to-failure):

            A
   MTTF = ----- * exp( Ea / (k * T) )
           J^n

   J  = current density   (A/cm^2)   -- you control this in layout
   n  ~ 2                              -- failure scales with J SQUARED
   T  = absolute temperature (K)       -- shared with the thermal section!
   Ea = activation energy (eV)         -- material property (Cu > Al)
   k  = Boltzmann's constant

Two lessons fall straight out:
  1) DOUBLE the current density  -> ~4x SHORTER life  (the J^2)
  2) HOTTER metal               -> exponentially shorter life (the exp term)

=> Electromigration is a thermal problem wearing an electrical mask.
   The hot spot you failed to cool is also the wire you are about to lose.

Black's equation ties wire lifetime to current density (squared) and temperature (exponentially). This is why thermal and EM cannot be designed separately.

In packaging the worst offenders are the smallest conductors carrying the biggest currents: the solder bumps and micro-bumps under a flip-chip. As pitch shrinks toward a few microns for hybrid bonding and dense 3D stacks, each joint carries more current through less cross-section — current density climbs exactly where Black's J² punishes you hardest. The power-delivery network is the front line: the bumps and TSVs feeding the core draw amps continuously and never get a rest. EM-aware design widens those nets, splays current across many parallel bumps, and caps the DC density per joint.

CTE mismatch: the crack underfill was fighting

Back on rung 2 you met underfill — the epoxy syringed under a flip-chip die — and were told it ‘improves reliability.’ Now you can see the enemy it was fighting. Silicon expands when heated by about 2.6 parts per million per degree; the organic substrate beneath it expands by 15–17 ppm/°C, roughly six times as much. Every time the chip powers up and warms, the substrate stretches farther than the die it is bonded to. The solder bumps in between get sheared. This is CTE mismatch — a difference in coefficient of thermal expansion — and it is mechanical, relentless, and built into the materials themselves.

A single warm-up bends the joint a little; it springs back on cool-down. But a chip lives by cycling: on, off, idle, burst, sleep, wake — thousands of times. Each cycle is one tug on a paperclip. Bend a paperclip once and nothing happens; bend it back and forth a few hundred times and it snaps. Solder does exactly this. The damage accumulates as low-cycle fatigue, and the bumps at the *corners* of the die — farthest from the neutral centre, so they travel the most per degree — crack first. The classic failure is a corner-ball open after a few thousand power cycles.

Underfill defeats this by gluing the die and substrate into one body, so the shear that used to concentrate on a few corner balls is now smeared across the whole epoxy layer. It can extend bump fatigue life by 10× or more. The cost is that underfill makes rework nearly impossible — once it cures, a defective die is permanently entombed — which is why known-good-die screening before assembly matters so much in expensive 3D and 2.5D builds. You cannot fix what you have already glued in.

Qualification: proving years in weeks

Here is the impossible-sounding deadline of the whole industry: prove a part survives ten years, but ship in eighteen months. You cannot wait a decade to find out. Reliability qualification solves this by accelerated stress — you turn the temperature, voltage, current, and humidity knobs past anything the field will ever see, lean on the same Arrhenius exp(Ea/kT) physics from Black's equation, and compress years of ageing into weeks. The governing idea is the acceleration factor: how many field-hours one test-hour buys you.

Temperature-cycling (TC). Slam parts between −55 °C and +125 °C, hundreds to thousands of times, to attack exactly the CTE-fatigue cracks in the solder. A pass might be ‘1000 cycles, zero opens’ — that is the bench proxy for years of your phone heating and cooling in a pocket.
HTOL (High-Temperature Operating Life). Run the part *powered and switching* at high temperature and voltage for ~1000 hours. This ages transistors and exercises electromigration in the metal and bumps under real current — it is the EM and wear-out clock turned to fast-forward.
Burn-in. Briefly bake every shipping part under stress to provoke ‘infant mortality’ — the weak units that would have died in the first weeks fail here, on your line, not at the customer. This trims the early bathtub-curve hump.
Environmental & moisture (THB / uHAST / HAST). Combine heat, humidity and bias to drive corrosion and moisture-induced failures, plus the dreaded popcorn crack — trapped moisture flashing to steam during board reflow and splitting the package open.

The frontier: thermal and reliability are the new ceiling

Now zoom out and let every rung of this track click into place. We adopted advanced packaging because plain transistor scaling stopped delivering — once Dennard scaling ended, power per square millimetre kept rising even as logic shrank, and the answer was to spread the system across chiplets and stack memory close in 3D. But that very move concentrates heat into smaller, taller volumes and crams more current through finer joints. We escaped one wall and walked straight into another.

That new wall is thermal-and-reliability co-limiting. You could stack four logic dies high — the bonding exists, hybrid bonding gives you the pitch — but you cannot cool the third one or guarantee its bumps survive the current. The honest constraint on a 2026-era design is no longer ‘how many transistors fit,’ it's ‘how many watts can we get out, and will the joints last ten years at that current density.’ Thermal sets how tall you can stack; electromigration sets how much current each tiny joint can carry; fatigue sets how many cycles before the corners crack. These three now decide the architecture *before* the first transistor is placed.

So the field is racing on the cooling and reliability front as hard as on the transistor front: microfluidic channels etched *into* the silicon to pump coolant millimetres from the hot spot; backside power delivery to unclog the PDN and shorten the current's path; new low-CTE substrate materials to relax the fatigue; and physics-based EM and thermal sign-off built into the design-for-manufacturability flow so a corner-ball failure is caught in simulation, not in a customer's car. The mastery view of this whole track is simple to say and hard to do: making a chip work is now the easy half — making it survive is the frontier.