From a sketch in your head to a wafer: the flow
In the previous rungs you designed an architecture: a pipeline that forwards results, a cache hierarchy, maybe an accelerator. You wrote it, ultimately, as Verilog — a few tens of thousands of lines describing *behaviour*: when the clock ticks, this register loads that sum. What you did not write is a single transistor, a single wire, a single rectangle of metal. Yet a modern SoC has tens of billions of transistors and dozens of metal layers stacked like a city's roads. The gap between your behavioural description and that physical reality is bridged entirely by software, and the umbrella name for that software is EDA — electronic design automation.
The flow is a relay of stages, each one lowering your design to a more physical form. Think of it as compiling, except the target is not machine code — it is a photomask set that a fab will print with EUV light.
- Logic synthesis — translate your RTL into a netlist of generic boolean gates, optimised for area, speed and power.
- Technology mapping — replace those generic gates with the real cells in the foundry's standard-cell library (this exact NAND4, that drive-strength-2 flip-flop).
- Partitioning & floorplanning — chop a huge design into manageable blocks and decide where each block lives on the die.
- Placement — assign every single cell an exact (x, y) location inside its block.
- Clock tree & routing — build the clock distribution, then connect every pin with actual metal wires across the layer stack.
- Sign-off — verify timing, power, and that the layout obeys every fab rule, before taping out.
Logic synthesis: turning behaviour into gates
Start at the top. Logic synthesis is the compiler of the hardware world. It reads your RTL, figures out what boolean function each chunk computes, and emits a combinational network of gates — AND, OR, NOT, XOR, multiplexers — plus flip-flops for state. Crucially it does this while *minimising* a cost you choose: fewest gates (area), shortest critical path (speed), or least switching (power). The same RTL `assign y = a & b | a & c;` can become two gates or one, fast or slow, depending on what you ask for.
RTL (what you wrote): One synthesised form (what you get):
assign y = a & b | a & c; a --|&|---\
b --| | \
|OR|-- y
a --|&|--/
c --| |
Boolean-minimised: y = a & (b | c) -> one AND + one OR, fewer transistorsHow does a tool *know* that `a & b | a & c` equals `a & (b | c)`? It needs to reason about boolean functions formally — compare two of them, prove they are identical, or find a cheaper one. The classic data structure for this is the binary decision diagram (BDD): a compressed, canonical graph that represents a boolean function as a cascade of "if this input is 1 go here, else go there" decisions. Reduce it to its canonical form and two functions are equal if and only if their BDDs are the identical graph — equivalence checking becomes a pointer comparison.
The SAT solver: the workhorse that proves things
Underneath an enormous amount of EDA — equivalence checking, formal property verification, even some routing and test-pattern generation — sits one deceptively simple question. Given a boolean formula, is there *any* assignment of true/false to its variables that makes the whole thing true? That is Boolean satisfiability, or SAT, and a program that answers it is a SAT solver. SAT was the very first problem ever proven NP-complete (Cook, 1971), so in the worst case it is hopeless. The plot twist of the last 25 years is that modern solvers crush *real-world* instances with millions of variables in seconds.
How do you use a yes/no satisfiability oracle to *prove* two circuits are equal? You build a miter: glue the two circuits to the same inputs, XOR their outputs together, and ask SAT "can this XOR ever be 1?" If the solver says *unsatisfiable*, no input can make them differ — they are provably equivalent. If it returns a satisfying assignment, that assignment is a concrete counterexample showing exactly which inputs break them apart. This trick — turning a verification question into a satisfiability question — is the beating heart of formal equivalence checking.
golden circuit --+ (your RTL) |---[XOR]--- miter_out ask SAT: "miter_out = 1 ?" revised circuit --+ (after synthesis) UNSAT -> outputs can never differ -> circuits are EQUIVALENT (proof) SAT -> here is an input vector that makes them differ -> BUG, with witness
Onto silicon: partition, then place
Now you have a netlist — millions of mapped gates and the wires between them. The next job is brutally physical: give every gate an (x, y) coordinate on a flat rectangle of silicon. But you cannot optimise millions of objects at once, so the tool first does circuit partitioning: cut the netlist into clusters that talk a lot internally but little across the cut. Minimising the wires crossing a partition boundary is exactly min-cut, and the classic heuristic — Kernighan–Lin, later Fiduccia–Mattheyses — repeatedly swaps nodes between two halves to shrink the cut. Fewer cross-cut wires means less long routing later, which means less delay and less power.
With blocks shaped during floorplanning, the tool runs the placement algorithm to nail down exact cell positions. Modern placers are mostly analytic: they model every wire as a spring pulling its connected cells together (minimise total spring energy, i.e. wirelength) and solve a giant quadratic optimisation — then they *legalise*, nudging the overlapping, spread-out result onto the legal cell grid. A connected cousin is the older, more intuitive approach below: treat placement as cooling metal.
Simulated annealing: borrowing physics to escape bad answers
How do you optimise something with billions of arrangements and no formula for the best one? One of the most beautiful answers in all of EDA comes straight from metallurgy. When a blacksmith anneals metal, they heat it until atoms jiggle freely, then cool it *slowly*; the atoms settle into a low-energy, near-perfect crystal. Cool it too fast (a quench) and the atoms freeze into a messy, high-energy state full of defects. Simulated annealing is that physics turned into an algorithm.
- Define an energy (cost) for the current solution — for placement, total wirelength plus congestion plus timing penalty.
- Make a small random move — swap two cells, shift one a little.
- If the move lowers energy, always accept it. If it raises energy, accept it anyway with probability e^(-ΔE/T).
- Slowly lower the temperature T according to a cooling schedule, so uphill moves get rarer over time.
- Stop when T is near zero and the solution stops improving — it has 'frozen' into a good local optimum.
That single line — *accept some moves that make things worse* — is the whole trick. A greedy optimiser that only ever steps downhill gets trapped in the first valley it stumbles into. By tolerating the occasional uphill step while it is still 'hot', annealing can climb out of a mediocre valley and find a deeper one. As it cools, it commits. It is a search that is deliberately, controllably bad early so it can be excellent late.
accept_prob(dE, T) = (dE <= 0) ? 1.0 : exp(-dE / T)
T high (hot): exp(-dE/T) ~ 1 -> almost any move accepted (explore widely)
T low (cold): exp(-dE/T) ~ 0 -> only improving moves (refine, commit)
cost landscape: you are here
v
\ good /----\ * /\ <- greedy stops in this shallow dip
\ (deep) / \__/ \___ <- annealing's uphill jumps reach the deep one
\__valley_/Routing, and the timing graph that judges it all
Cells are placed; now connect them. The routing algorithm must lay physical metal for every net across a dozen layers, obeying spacing rules, dodging blockages, and never letting two unrelated nets short. At its core sits a shortest-path search on a 3-D grid — Lee's maze router, or A* — wrapped in global/detailed phases and the classic rip-up and reroute: when nets fight over the same track, tear out the loser and try another path. Tools route over a billion connections this way; the miracle is that it converges at all.
But a routed chip that is *connected* is not a chip that *works*. It must also be fast enough: every signal launched by a flip-flop must arrive at the next flip-flop before the clock edge that captures it. Checking this for billions of paths, without simulating a single waveform, is the job of static timing analysis — and it rests on one elegant idea, the timing graph.
Model the whole circuit as a directed graph: every gate pin is a node, every gate delay and wire delay is a weighted edge. The longest path from a flip-flop output to the next flip-flop input — the critical path — sets the fastest clock the chip can run. Finding the longest path in a general graph is NP-hard, but a timing graph is a DAG (no combinational loops), and in a DAG longest-path is solvable in linear time by a single topological-order sweep. That is why STA can grade an entire chip in minutes instead of millennia.
FF1/Q --[g1: 0.12ns]--> a --[wire: 0.05]--> b --[g2: 0.20]--> c --[wire: 0.08]--> D/FF2
Arrival time at each node = max over incoming edges (longest path so far):
AT(a)=0.12 AT(b)=0.17 AT(c)=0.37 AT(D)=0.45 ns
Required time at D = clock_period - setup = 1.00 - 0.10 = 0.90 ns
SLACK = Required - Arrival = 0.90 - 0.45 = +0.45 ns -> PASS (timing met)
(negative slack = critical path too slow = the chip fails at this frequency)Step back now and see the whole arc of this track. You learned to *think* like an architect — pipelines, hazards, caches, parallelism, accelerators. This final rung shows the machinery that turns that intent into atoms. Every architectural choice you make lands, eventually, as a knob the EDA tools must satisfy: a deeper pipeline shortens the critical path the timing graph measures; a wider cache becomes a giant macro the floorplanner must place; an accelerator's regular dataflow makes the router's job easy or impossibly congested. Architecture and EDA are not two worlds — they are two ends of one continuous lowering. And notice the deep unity in the toolbox: synthesis and verification by BDDs and SAT solvers, partitioning by min-cut heuristics, placement and floorplanning by simulated annealing and analytic solvers, routing by shortest-path search with rip-up, timing by longest-path on a DAG. None finds the perfect answer; all find one good enough to print and fast enough to iterate — which is the only reason a billion-transistor design is buildable by a team of humans at all.