Seeing the Spectrum: the DFT and the FFT

From a wiggle to a recipe

Hold a glass prism up to white sunlight and a rainbow fans out across the wall. The light was never "white" in any pure sense — it was a sum of every colour at once, and the prism just sorted them by frequency. A block of audio samples is exactly the same kind of secret. When your microphone captures a chord, the wire carries one squiggly voltage, one number per instant. Buried inside that single stream are the separate notes — a 261 Hz C, a 329 Hz E, a 392 Hz G — all added together. The job of a Fourier transform is to be the prism: take the squiggle in, hand the recipe back out.

In rung 3 you met the z-transform, which mapped a discrete-time signal onto the whole complex z-plane and let you read its frequency behaviour by walking around the unit circle. The discrete Fourier transform, or DFT, is what you get when you stop walking the whole circle and instead stand on N evenly spaced points of it. Those N points are your frequency bins. The DFT answers one crisp question for each bin: *how much of this exact frequency is in my signal, and with what phase?*

What the DFT actually computes

Here is the whole idea in one move. To measure how much of frequency bin *k* lives in your signal, you compare the signal against a reference wave that completes exactly *k* full cycles across your block of N samples. "Compare" means: multiply the two together sample by sample and add up the result. If your signal and the reference rise and fall in step, the products pile up into a big number. If they are unrelated, the products scatter into positives and negatives that cancel to nearly zero. That single number is bin *k*. Do it for every *k* from 0 to N−1 and you have the whole spectrum.

        N-1
  X[k] = Σ  x[n] · e^(-j·2π·k·n/N)        k = 0 … N-1
        n=0

  x[n]  : the n-th input sample (real audio)
  X[k]  : a COMPLEX number for bin k
  e^(-j·2π·k·n/N) : the reference wave —
          a cos and a sin spinning at k cycles/block

  magnitude |X[k]| = sqrt(Re^2 + Im^2)  -> "how much"
  phase  ∠X[k]    = atan2(Im, Re)       -> "where in its cycle"

The DFT sum. The complex exponential is just a cosine and a sine bundled together so one multiplication captures both amplitude and phase.

The output `X[k]` is complex on purpose. A real signal carries two facts at each frequency — how loud (magnitude) and how shifted in time (phase) — and a single real number cannot hold both. The magnitude `|X[k]|` is what you usually plot; the phase tells you the timing, which matters enormously for filters and for reconstructing the signal, but is invisible on a plain magnitude spectrum.

Reading the magnitude spectrum: which bin is which Hz

A spectrum is useless until you can name the frequency under each peak. The rule is short and you should memorise it. If you sampled at rate Fs (samples per second) and took N samples, then bin *k* sits at frequency k · Fs / N hertz. The spacing between neighbouring bins — your frequency resolution — is therefore Fs / N. Want finer resolution? Take more samples (bigger N) or wait longer; physics gives you no free lunch.

  Fs = 8000 Hz sample rate,  N = 1024 samples

  bin spacing  = Fs / N = 8000 / 1024 = 7.8125 Hz per bin

  bin   0  ->     0 Hz   (DC, the average level)
  bin   1  ->   7.81 Hz
  bin  32  ->  250.0 Hz
  bin  64  ->  500.0 Hz
  bin 512  -> 4000.0 Hz  = Fs/2  (Nyquist, the top bin)
  bin 513..1023 -> mirror image of bins 1..511

The bin-to-hertz map for a 1024-point FFT at 8 kHz. The top half of the bins just mirror the bottom for real signals.

Two quirks always trip up beginners. First, bin 0 is DC — the plain average of the block — not a frequency at all. Second, for a real-valued signal the spectrum is symmetric: bins above N/2 are a mirror image of the bins below it, carrying no new information. That mirror line at bin N/2 sits exactly at Fs/2, the Nyquist frequency you met when learning sampling. In practice you plot only bins 0 to N/2 and ignore the reflection.

The FFT: how N² became N log N

The fast Fourier transform is not a different transform — it computes exactly the same `X[k]` the DFT does, to the last bit. It is a faster *route* to the same answer, and the trick is divide-and-conquer. Split your N samples into the even-indexed ones and the odd-indexed ones. It turns out each half's DFT can be reused, with a twist, to build the full result. Recurse: split each half again, and again, until you reach blocks of size one. Each level of splitting touches all N samples once, and there are only log₂N levels. N work per level times log₂N levels gives the famous N log N.

  Operation count: direct DFT  vs  radix-2 FFT

     N        N^2          N·log2(N)     speed-up
   -----   -----------   ------------   ---------
     64        4,096           384          ~11x
    256       65,536         2,048          ~32x
   1024    1,048,576        10,240         ~102x
   4096   16,777,216        49,152         ~341x
  65536    4.29e9          1,048,576      ~4096x

  The bigger the block, the more lopsided the win.

Why the FFT changed everything: the gap between N² and N log N explodes as N grows.

This is not a rounding-error improvement — it is the difference between *possible* and *impossible*. A 1024-point spectrum drops from a million operations to ten thousand, a hundredfold cut. That is precisely why a DSP processor or a phone's audio chip can paint a live spectrum analyser at video frame rates, why your Wi-Fi radio demodulates OFDM carriers in real time, and why an MRI scanner can reconstruct an image from raw frequency data. When people credit Cooley and Tukey's 1965 FFT paper with launching the digital signal processing age, this single algorithmic leap is what they mean.

Worked example: two tones, two peaks

Let's make the abstraction concrete. We sample at Fs = 8000 Hz and grab N = 1024 samples — that is 0.128 seconds of sound. The signal is a sum of two pure sine tones: a loud one at 250 Hz with amplitude 1.0, and a quieter one at 1000 Hz with amplitude 0.5. On the oscilloscope this looks like a fat wave with smaller ripples riding on top — pretty, but it does not announce its ingredients. Feed the block to an FFT and the answer is unmistakable.

Find the bin for 250 Hz: k = f · N / Fs = 250 × 1024 / 8000 = bin 32, landing exactly on a bin.
Find the bin for 1000 Hz: k = 1000 × 1024 / 8000 = bin 128, again landing exactly on a bin.
Compute the magnitudes `|X[k]|`. Every bin except 32 and 128 is essentially zero; those two stand up as sharp spikes.
Read the relative heights: the 250 Hz spike is twice as tall as the 1000 Hz spike, matching the 1.0-vs-0.5 amplitude ratio of the input tones.

  |X[k]| magnitude spectrum  (bins 0..160 shown)

  mag
   |        *  (bin 32 = 250 Hz, height ~512)
   |        |
   |        |
   |        |              *  (bin 128 = 1000 Hz, height ~256)
   |        |              |
   |________|______________|________________ bin (Hz)
   0   16   32   48 ... 112 128 144  160
      (125) (250)        (1000)

  Two clean tones in -> two clean peaks out.
  Peak 1 is twice peak 2: amplitude 1.0 vs 0.5.

The FFT of a two-tone signal. Because both frequencies land exactly on bins, the peaks are razor-sharp with no leakage.

Notice why this example is so tidy: both tones were chosen to land *exactly* on a bin centre, so all their energy collects in one bin and the peaks are perfect spikes. Real-world signals are not so polite — a guitar string at 247 Hz would fall between bins 31 and 32 and smear across both. That gap between this clean demo and messy reality is exactly the leakage problem the next rung tackles head-on.

Where this shows up in the wild

Once you can see a spectrum, you start seeing them everywhere. The dancing bars on a music player's visualiser are FFT magnitudes, one bar per group of bins, redrawn dozens of times a second. A guitar tuner runs an FFT and reports the frequency of the tallest peak. Noise-cancelling headphones and voice assistants split incoming audio into a spectrum, decide which bins are speech and which are hum, suppress the hum, and transform back to time.

In communications the FFT is even more central. Every Wi-Fi, 4G/5G, and digital-TV link uses OFDM, which packs data onto hundreds or thousands of closely spaced subcarriers — and the receiver separates those subcarriers with a single FFT per symbol. A modern phone runs millions of FFTs a second just to stay connected. Beyond audio and radio, the FFT reconstructs MRI and CT images, finds vibration signatures in jet-engine bearings, and compresses images in JPEG's cousin transforms. It is, by some counts, the most-executed nontrivial algorithm in the history of computing.