DSP in the Real World: Multirate, Fixed-Point, and DSP Processors

From equations to silicon: the gap nobody warns you about

In a textbook a filter is a clean line of algebra: multiply each sample by a coefficient, add them up, done. The numbers are real, the clock is infinite, and the sum never overflows. Real hardware grants none of these favours. Your samples arrive at, say, 48,000 per second whether you are ready or not; your numbers are 16-bit integers, not the real line; and if you ask for 192 multiply-adds per output, you had better finish them before the next sample lands. This rung is about that gap — and the three classic tricks engineers use to cross it: changing the sample rate cleverly (multirate DSP), living inside fixed-width integers (fixed-point arithmetic), and running it all on a machine built for the job (the DSP processor).

Hold one picture in your head as we go. A signal is a river of numbers flowing past a fixed point in time. Everything you do — filtering with an FIR filter, taking an FFT, correlating against a template — is something you must do to each drop before the next one arrives. The art of real-world DSP is arranging the work so you are never behind the river.

Multirate DSP: don't compute samples you'll throw away

Suppose your microphone hands you audio at 48 kHz but your speech model wants 8 kHz. The naive route is to filter at 48 kHz then keep one sample in six. Decimation (downsampling) does exactly this, but in the smart order: first a low-pass filter to kill everything above 4 kHz — because dropping samples is itself a form of sampling, and anything above the new Nyquist limit would fold back as aliasing — *then* throw five of every six samples away. Going the other way, interpolation (upsampling) inserts zeros between samples to raise the rate, then low-pass filters to smooth the spectral images those zeros create. Together they let you resample by any rational factor L/M.

Decimate by 6   (48 kHz  ->  8 kHz)

  in @48k --> [ LPF  fc=4kHz ] --> [ keep 1 of 6 ] --> out @8k
              (anti-alias)          (downsample)

Interpolate by 3  (16 kHz  ->  48 kHz)

  in @16k --> [ insert 2 zeros ] --> [ LPF fc=8kHz ] --> out @48k
              (upsample, x3)          (anti-image)

Key insight: in decimation the filter runs at the HIGH rate but
you only need to compute outputs you keep -> a polyphase filter
computes only those, doing 6x less arithmetic.

Decimation filters before dropping samples; interpolation filters after inserting zeros. Polyphase structures skip the arithmetic you would have discarded.

The deep win is polyphase decomposition. In a decimate-by-6 filter, five of every six outputs of the anti-alias filter are computed only to be discarded — pure waste. A polyphase implementation factors the filter so it produces *only* the outputs you keep, cutting the multiply count by the same factor of six. This is why your phone can resample audio between a dozen sample rates without the battery noticing.

Fixed-point: doing serious math with integers

On paper a coefficient is 0.7071. In cheap, low-power hardware there is no floating-point unit to store it — only integers. Fixed-point arithmetic solves this by silently agreeing on where the decimal point lives. In Q15 format, a signed 16-bit integer represents a fraction between −1 and just under +1: the stored integer is the real value times 2¹⁵. So 0.7071 is stored as round(0.7071 × 32768) = 23170. Multiply two Q15 numbers and you get a Q30 result in 32 bits; shift right by 15 to return to Q15. The hardware only ever sees integers; the *programmer* keeps track of the point.

Q15 fixed-point  (one sign bit . fifteen fraction bits)

  real 0.7071  ->  0.7071 * 2^15 = 23170   (stored int16)
  real -1.0    ->  -32768                   (most negative)
  largest      ->  +32767  = 0.99997        (can't reach +1.0!)

  multiply:  int32 prod = (int32)a * (int32)b;   // Q15 * Q15 = Q30
             int16 out  = prod >> 15;            // back to Q15

  resolution (one LSB) = 2^-15 = 0.0000305
  -> rounding it away injects QUANTIZATION noise

Q15: the integer is the real value scaled by 2^15. Every product temporarily needs double width before being shifted back.

This frugality has two costs, and respecting them separates working firmware from a glitchy mess. The first is [[quantization|quantization]] noise: rounding every coefficient and every result to the nearest LSB injects a tiny error, like dust on a lens. With 16 bits you get roughly 96 dB of dynamic range (≈6 dB per bit); with 24 bits, about 144 dB — which is why studio audio and seismometers reach for the wider word. The second cost is overflow: add two large Q15 numbers and the sum can exceed +1, wrapping a near-maximum positive into a large negative — a sickening *click* in audio or an instability in a control loop.

Pick a format with headroom. If sums of products can reach 4, leave 2–3 integer bits (use Q13, not Q15). Budget the bits before you write a line of code.
Accumulate wide. Sum your products in a 40-bit accumulator with guard bits, not a 16-bit register, so intermediate sums can't overflow even if the final answer fits.
Saturate, don't wrap. Use saturating arithmetic so an overflow clamps to the maximum value instead of wrapping to a huge opposite-sign number — a gentle distortion beats a violent click.
Scale before you risk it. Knock a signal down a few bits before a gainy stage, then scale back after — trading a little noise floor for guaranteed overflow safety.

The DSP processor: a machine that multiplies and adds for a living

Look closely at any DSP algorithm — an FIR filter, an convolution, an FFT butterfly — and you find the same atom repeated millions of times: acc = acc + (x × h), a multiply followed by an add. A general-purpose CPU does this in several instructions. A DSP processor does it in *one*, every clock cycle, in a dedicated MAC unit (multiply-accumulate). That single architectural choice is the difference between keeping up with the river and drowning.

FIR filter:  y[n] = sum_{k=0..N-1} h[k] * x[n-k]

On a DSP, the inner loop is ONE instruction per tap:

  loop  MAC  *AR0+, *AR1+, A    ; A += (*AR0) * (*AR1)
             ; AR0 walks coeffs h[], AR1 walks a CIRCULAR buffer x[]
             ; single cycle: fetch 2 operands, multiply, add, bump ptrs

  A 64-tap FIR at 48 kHz  = 64 MACs x 48000 = 3.07 MMAC/s
  A 1GHz single-MAC DSP   = 1000 MMAC/s  ->  loafing at 0.3% load

The circular buffer is the trick: AR1 wraps from the end of x[]
back to the start automatically, so the delay line needs NO
memory shuffling -- just a moving write pointer.

One MAC instruction per filter tap, with hardware circular addressing so the delay line never has to be copied.

Two more hardware features make the MAC unit unstoppable. First, circular buffers: an FIR's delay line is the last N samples, and naively you would shift every sample down one slot per output — N copies of pure overhead. Instead, the address generator wraps a write pointer around a fixed buffer, so a new sample overwrites the oldest with zero shuffling. Second, Harvard memory: separate buses for instructions and data (often two data buses) let the processor fetch a coefficient *and* a sample *and* the next instruction in the same cycle, so the MAC is never starved. Add zero-overhead hardware loops and you have a machine that can sustain one tap per clock indefinitely.

Correlation: finding a needle the radio already knows

We close with the operation that shows DSP's reach beyond filtering. Correlation slides one signal across another and asks, at every offset, *how much do these two look alike?* It is convolution without the time-flip — the same stream of MACs — and it answers a profoundly useful question: where, and how strongly, is a known pattern hiding inside a noisy stream? A high correlation peak says *the template is here, at this exact lag*.

When the pattern you correlate against is the *exact shape of a transmitted pulse*, correlation becomes a [[ee-matched-filter|matched filter]] — provably the best linear way to detect a known signal buried in white noise, because it concentrates all the signal's energy into one sharp spike while the noise stays spread out. This is the quiet engine behind an astonishing amount of technology.

Correlation of received r[n] with known template s[n]:

  R[d] = sum_n  r[n] * s[n-d]      (just a sliding MAC, like FIR)

  --- noise floor ---  ...,-2,1,-1,3,-1, |  17  | ,-2,1,3,-1,...
                                          ^^^^^^
                       sharp peak at lag d* = the template is HERE

  Radar:   d* * c / 2          -> target distance
  GPS/CDMA: align local PN code -> lock + which satellite
  5G/Wi-Fi: correlate preamble  -> frame start & timing
  Sonar/ultrasound, DNA seq, even audio fingerprinting (Shazam)

Correlation is a sliding multiply-accumulate; the peak's location is the answer. Convert lag to range, timing, or identity depending on the application.

The reach is genuinely staggering for one operation. A radar correlates the echo against the pulse it sent; the lag of the peak, times the speed of light over two, is the target's distance. A GPS receiver correlates the sky against each satellite's known pseudo-random code, simultaneously locking on, identifying the satellite, and measuring range — the same spread-spectrum idea that lets dozens of CDMA phones share one band. A 5G or Wi-Fi receiver correlates against a known preamble to find exactly where each frame begins. One sliding MAC, run on a DSP processor in fixed point, and you have synchronisation, ranging, and detection — the connective tissue of the wireless world.

The system view: where it all clicks together

Step back and watch a software-defined radio swallow an FM broadcast, and you will see every rung of this track working as one organism. The antenna's wideband stream is mixed down and decimated to isolate one station — multirate. The samples live as fixed-point integers to keep the front end cheap and cool. A bank of FIR filters running on MAC units with circular buffers separates audio from noise — the DSP processor. A correlation against a known sync word finds the start of each data frame — and an FFT turns the leftover spectrum into the picture you stare at while tuning.

That is the whole point of this final rung. The transforms, filters, and sampling theorems from earlier were never separate tools — they were parts waiting for a chassis. Multirate decides *how fast* the numbers flow, fixed-point decides *how wide* each number is, and the DSP processor decides *how many* operations you can afford per second. Master those three constraints and you can take any block-diagram from a paper and make it run, in real time, on a chip you can hold.