Beyond Moore: Specialization & What's Next

When general-purpose stops scaling

For most of computing history you got a free lunch. Every couple of years the transistor shrank, Moore's Law doubled how many you could pack onto a die, and — this is the part people forget — Dennard scaling meant each of those smaller transistors also drew *less* power. Faster *and* cooler, for free, just by waiting. A general-purpose CPU could ride that wave: keep the same design, shrink it, clock it higher, and last year's chip felt slow. You did not have to be clever. Physics was clever for you.

Around 2006, half of that free lunch ended. Dennard scaling broke: below a certain size, leakage and other effects mean a smaller transistor no longer sips less power per unit area. Transistors kept shrinking — Moore's Law limped on — but the power per square millimetre stopped falling. That is the power wall, and it has a brutal consequence. You can still fit billions of transistors on a die, but you can no longer afford to switch all of them at once: you would melt the chip. The fraction you must leave dark at any instant is dark silicon, and it grows with every node.

So the old strategy — one general-purpose engine, shrink it, clock it faster — hit a ceiling. Clock speeds flattened around the mid-2000s and have barely moved since. The industry's response was first 'add more cores,' but dark silicon limits how many you can light up too. The deeper answer is the theme of this capstone: if you can only afford to power *some* of your transistors, make the ones you do power exactly right for the job. Stop being general. Start being specialized.

Domain-specific architectures

Here is the key insight of the era. A general-purpose CPU is built to run *anything* — a web browser, a spreadsheet, a game — so it spends most of its transistors and energy on flexibility: fetching and decoding instructions, predicting branches, shuffling data around, just in case. For any *one* specific job, that overhead is almost pure waste. A domain-specific architecture (DSA) makes the opposite bet: give up the ability to run everything, and in exchange spend every transistor and every picojoule on doing *one kind* of work astonishingly well.

The clearest example is the matrix-multiply at the heart of modern AI. A CPU does it instruction by instruction, paying decode and scheduling overhead on every step. A GPU throws thousands of simple arithmetic units at it in parallel. A TPU or NPU goes further still: it hard-wires the exact dataflow of a matrix multiply into silicon, so data marches through a grid of multipliers with almost no instruction overhead at all. The metric that matters here is not raw speed but performance per watt — useful work done per unit of energy — because in a dark-silicon-limited world, energy *is* the budget.

Everything you learned on the digital and physical-design rungs still applies inside a DSA — it is still standard cells and place-and-route on a process node. What changes is the *architecture*: instead of one big flexible engine, you design many lean, purpose-built ones. Which raises the obvious next question — once you have a dozen specialized blocks, how do you design and connect them without each one fighting the others for power, area, and the wires between them?

System-technology co-optimization

On the rungs below, the layers were designed in sequence and mostly in isolation: the system architects chose what to build, handed it to logic designers, who handed it to physical designers, who handed it to the fab. Each layer treated the one below as a fixed menu. That worked while shrink was doing the heavy lifting. It stops working when the gains have to come from how the layers *fit together*.

System-technology co-optimization (STCO) is the discipline of designing the system, the chip, *and* the manufacturing technology and package together, as one negotiation, so that decisions at one layer reshape the others. A famous example is backside power delivery: by moving the power wires to the *back* of the wafer, you free up the precious front-side metal layers for signals — but that only pays off if the architects and physical designers *know* it is coming and route to exploit it. The technology choice and the design choice only make sense made together.

OLD: design as a relay race (each layer optimized alone)

  System ──▶ Logic ──▶ Physical ──▶ Package ──▶ Fab
  ("here")   ("ok")    ("fine")     ("sure")    (build)
   one-way hand-offs; each layer takes the one below as fixed

STCO: design as one round table (co-optimize all at once)

        ┌─────────── System ───────────┐
        │                              │
     Package ◀── co-design ──▶  Logic / arch
        │                              │
        └──────── Process tech ────────┘
   every arrow is two-way: a package or process choice
   reshapes the architecture, and vice-versa

STCO replaces the one-way relay race with a round table. Backside power, chiplet partitioning, and memory placement are decisions no single layer can make well alone.

STCO is the natural successor to design-technology co-optimization (DTCO), which co-tuned just the cell library and the process. STCO widens the table to the whole system and the package. And once the *package* has a seat at that table, a new question opens up: what if the package itself — not just the transistor — becomes the place where the value is created?

"More than Moore"

For decades, 'progress' meant one thing: shrink the transistor, the 'More Moore' path. 'More than Moore' is the complementary idea that you can deliver more value *without* a smaller transistor — by integrating things cleverly. If you cannot make the bricks smaller, build a better building. The center of gravity moves from the transistor to the package, and this is where every packaging idea from the rungs below pays off.

The pivotal move is to stop building one huge monolithic die and instead build several small ones — chiplets — and join them in one package. The wins are concrete. Yield: a defect ruins only one small die, not a giant one, so you assemble products from known-good die and throw away far less silicon. Mixing nodes: dense logic can use a bleeding-edge process node while I/O or analog stays on a cheaper, mature one — each block on the node that suits it. They talk to each other over UCIe, the open die-to-die interconnect standard, so chiplets from different vendors can be designed to interoperate. This is heterogeneous integration, and it is the heart of modern advanced packaging.

MONOLITHIC: one giant die        CHIPLETS: many small dies, one package
  ┌───────────────────────┐       ┌──────┐ ┌──────┐ ┌──────┐
  │  CPU   GPU   I/O   SRAM│       │ CPU  │ │ GPU  │ │ I/O  │
  │   all on ONE node;    │       │3nm   │ │3nm   │ │mature│
  │   one defect = whole  │       └──┬───┘ └──┬───┘ └──┬───┘
  │   die scrapped        │       ═══╪════UCIe═╪════════╪═══   <- die-to-die
  └───────────────────────┘       ───┴─── interposer ──┴───   links
   big = lower yield               known-good die; mix nodes; better yield

The shift in one picture: from a single shrinking die that must be perfect on one node, to many small known-good dies — each on the node that suits it — linked by an interposer and UCIe.

A glimpse beyond CMOS

Everything so far still rests on the silicon MOSFET switched in CMOS. FinFET (from around the 22nm node) and gate-all-around nanosheets (around 3nm and below) are clever new *shapes* for that same switch, wrapping the gate ever more tightly around the channel to keep control as it shrinks; CFET — stacking the two transistor types on top of each other — is the next shape on the research horizon. But all of them are still silicon CMOS. The honest long-range question is what comes *after* the silicon transistor itself runs out of shapes.

That research frontier is beyond CMOS — devices that compute without a conventional silicon channel. Three families are worth knowing by name. Carbon nanotubes and 2D materials (atom-thin sheets like molybdenum disulfide) offer channels just one or a few atoms thick, where silicon's electrostatics fall apart. Spintronics encodes information not in charge moving through a channel but in the electron's *spin* — potentially switching with far less energy, a direct assault on the power wall. None of these is in your phone today; they live in labs and pilot lines.

The shape of the future

Step back and the arc is clear. The future of chips is not one general engine getting forever smaller and faster. It is many specialized blocks — each a domain-specific architecture tuned for performance per watt — co-designed with their package and process through co-optimization, and integrated by packaging as chiplets rather than fused into one shrinking die. Transistors keep evolving in shape, and someday perhaps in physics, but they are no longer the *only* lever. The lever is increasingly *how you put the pieces together*.

And every piece of this traces straight back down the ladder you climbed. The same transistor and CMOS logic from the bottom rung. The same RTL, standard cells, and place-and-route from the digital and physical rungs. The analog rung's bandwidth and signal-integrity limits, now answered by HBM and short interposer links. Photolithography pushed to its frontier with EUV. The frontier is not a different field — it is the *same* field, refusing to stop at the limit of any one rung and reaching for the next.

You started at a single switch and ended at the edge of what silicon can do. The chips of the next decade will be assembled, co-designed, and specialized far more than they are simply shrunk — and you now have the map to follow them, limit by limit, all the way back to the transistor where you began.