Clock Tree Synthesis

One clock, millions of flip-flops

Picture a stadium doing the wave. If everyone stood up at the exact same instant, you wouldn't see a wave at all — you'd see one clean, unified motion. That instant-everywhere snap is what a clock is supposed to be inside a chip: a single tick that reaches every flip-flop at the same moment, so they all launch and capture data together. A modern block can hold millions of flip-flops, and on every clock edge each one of them needs the beat.

A better picture than the wave is a lawn sprinkler. You want the water to hit every plant at the same time — the corner of the bed shouldn't get soaked a full second after the plants by the tap. If it does, the garden grows unevenly. The clock has the same job: it must spray its edge onto a sea of flip-flops so evenly that, as far as the logic is concerned, they all hear the tick simultaneously.

Why the clock can't just be a wire

The obvious idea is the wrong one: just run one fat wire from the clock source and tap every flip-flop off it. It fails for two reasons, and both are pure physics. First, fan-out. One clock pin trying to drive millions of flip-flop inputs is like one person trying to push a thousand swings at once — the source simply can't deliver enough current to flip that many transistor gates sharply. The edge that should be a crisp cliff turns into a lazy, sagging ramp, and slow edges make flip-flops unreliable.

Second, and more insidious, is RC delay. Every wire has resistance (R) and capacitance (C), and pushing a signal down a wire is less like flicking a light switch and more like flushing water through a long, thin hose — the far end responds noticeably later than the near end. So even if one wire *could* drive everything, the flip-flop sitting next to the clock source would hear the tick far earlier than the flip-flop in the opposite corner of the block. The single wire doesn't deliver one synchronized beat; it smears the beat across the chip. That smear has a name, and it is the enemy of this entire guide.

# A wire's delay grows roughly with the SQUARE of its length:
#   t_delay  ~  R_per_um * C_per_um * length^2
#
# Double a clock wire's length and you ~quadruple its delay.
# That is why a single source wire smears the clock edge across
# the die instead of delivering it everywhere at once.

The Elmore-delay intuition: long unbuffered wires get expensive fast. Length matters more than you'd guess because both the resistance and the capacitance you're charging grow with length.

Skew & insertion delay

Two numbers describe how a real clock arrives, and you need both. The first is insertion delay (also called clock latency): how long the tick takes to travel from the clock source all the way to a given flip-flop's clock pin. If the source generates an edge at time zero and a particular flop sees it 320 picoseconds later, that flop's insertion delay is 320 ps. Some latency is unavoidable — the signal has to physically get there — and on its own it isn't harmful, because the launching and capturing flops both ride the same delayed clock.

The number that *is* harmful is skew: the spread in arrival times across all the endpoints. Formally:

# Skew = how UNEVEN the clock arrival is across all endpoints
skew = t_arrival_max - t_arrival_min

# Insertion delay (latency) = source -> one flop's clock pin
#   e.g. flop A hears the edge at 300 ps  (latency_A = 300 ps)
#        flop B hears the edge at 345 ps  (latency_B = 345 ps)
#   skew between A and B = 345 - 300 = 45 ps

Insertion delay is how late the beat arrives; skew is how much that lateness varies from flop to flop. CTS fights skew, not insertion delay.

Here is the practitioner's point, and it ties straight back to the timing budget. Recall the setup limit from the front-end track: the clock period must cover `t_clk-to-q + t_logic + t_setup`, adjusted by clock skew. When the launch flop and the capture flop hear the edge at slightly different times, that mismatch lands directly in the slack equation. Skew that makes the capture clock arrive *later* hands a setup path some extra time — but it steals from the next stage and bites your hold check. Uncontrolled skew is uncertainty, and uncertainty is picoseconds subtracted from every budget. That's why CTS chases low, balanced skew above almost everything else.

Buffers & the H-tree

If one wire can't deliver the beat, you build a distribution network — and the trick is to make every path from the source to every flip-flop take the same amount of time. The textbook ideal is the H-tree. Picture the letter H. The clock enters at the center of the crossbar; the crossbar splits to two endpoints; at each endpoint you draw a smaller H, rotated; each of those splits again into still-smaller Hs. Because the geometry is symmetric, the wire length from the center to every leaf of the fractal is identical — so the delay to every leaf is identical, and skew collapses toward zero. It's the sprinkler designed so the corner plant and the tap-side plant get water at the same instant.

Symmetry handles distance, but it doesn't handle drive strength — and that's where clock buffers come in. At each branch point the tool inserts a buffer: a little amplifier that takes the weakening edge, restamps it into a fresh, sharp edge, and drives the next set of branches. Think of a relay race instead of one exhausted runner going the whole way — each buffer is a fresh runner taking the baton. The buffers also re-balance delay: if one branch is naturally a bit faster, the tool can lengthen its wire or add a buffer to slow it on purpose, dragging every leaf back into agreement. A finished clock tree is a few levels of these buffers fanning out from the root to the leaves.

# A clock-tree spec the CTS engine works to honor (vendor-neutral):
create_clock -name CLK -period 1.0 [get_ports clk]   ;# 1.0 ns = 1 GHz beat

set_clock_uncertainty 0.05 [get_clocks CLK]          ;# 50 ps budget reserved for skew+jitter
set_clock_transition  0.08 [get_clocks CLK]          ;# keep clock edges sharp (<= 80 ps)

# CTS targets, conceptually:
#   max_skew      <= 30 ps
#   max_latency   <= 350 ps   (insertion delay budget)
#   max_tree_depth: a handful of buffer levels root -> leaf

You declare the beat with create_clock; you reserve a slice of the period for skew and jitter with clock uncertainty; CTS then grows the buffer tree to hit the skew, latency, and transition targets you set.

Why CTS runs after placement

Here is a question worth pausing on: why is clock tree synthesis step three, after placement and before routing? Why not build the clock first? Because you cannot balance delays you can't measure, and you can't measure delays until you know where the flip-flops actually are. Skew is a property of geometry — distances and wire lengths — so the tool needs real cell positions on the floor before it can decide where to put buffers and how long to draw each branch. Trying to build the clock tree before placement is like trying to plan sprinkler pipe runs before you've planted the garden.

Floorplan first: fix the die, the macros, the I/O, and the power grid — the fixed landscape the rest of the flow lives on.
Place every standard cell, including all the flip-flops, into legal positions. Now the tool knows exactly where each clock endpoint sits.
Run CTS: cluster nearby flops, grow the balanced buffer tree, and insert clock buffers to hit the skew and latency targets — using the real distances placement just nailed down.
Then route the signal wires around the now-fixed clock network. Re-check timing afterward, because real routed wires carry the RC that paper estimates only guessed at.

There's a deeper reason the ordering matters: building the clock tree changes the timing landscape. Before CTS, timing tools assume an *ideal* clock — zero skew, arriving everywhere at once — so they can focus on getting the logic placed well. The moment the real tree exists, every flop's true clock-arrival time is locked in, and the tool flips to propagated clock mode: it now uses the actual insertion delays and skew when it computes slack. Place first so the tree is honest; build the tree so the timing becomes honest; then route and sign off against that honest timing.

A glimpse of useful skew

We've spent this whole guide treating skew as the enemy — and 95% of the time it is. But here's the advanced trick that flips the script: useful skew (also called intentional or clock skewing). The idea is that instead of forcing every flop to hear the tick at the exact same instant, you *deliberately* deliver the clock a little late to certain capture flops — handing their incoming data a few extra picoseconds to arrive.

Think of it as moving a deadline, not doing the work faster. Suppose a path is failing its setup check by 20 ps — it just can't get data to the capture flop in time. Rather than redesign the logic, the tool can delay *that flop's* clock edge by, say, 25 ps. Now the data has until the later edge to arrive, and the path passes. The catch — and there's always a catch — is that the same flop is the *launch* flop for the next stage, so delaying its clock borrows time from this path by stealing it from the next one (and it tightens that next stage's hold margin). Useful skew is time-borrowing across the clock, and the tool balances the whole chain so the borrowing nets out.

Zoom out and CTS is one sentence: deliver the beat everywhere at the same time, then bend that rule on purpose only where it pays. Get the tree balanced and your millions of flip-flops march in lockstep; get it wrong and you've quietly taxed every timing budget on the chip. With the clock distributed and the cells placed, the design is finally ready for the wires — that's routing, next.