Shrinking Models

Why a model is too big in the first place

By the time you reach this rung you can train a model, serve it, and watch its latency. But the model you trained is almost always bigger than the one you should deploy. A large language model with tens of billions of parameters holds each weight as a 16- or 32-bit number; multiply that by the billions of weights touched on every token and you get the brutal arithmetic of inference cost — memory to hold the weights, bandwidth to move them, and energy to multiply them, paid again on every single request.

There is a reason the model is fat, though, and it is not waste. Bigger capacity made it easier to *train* — extra parameters give gradient descent more room to find a good solution, and the scaling laws that drove the last decade say loss falls smoothly as you add parameters and data. The key insight behind every technique in this guide: the size you need to *learn* something is far larger than the size you need to *run* it. Shrinking is the art of throwing away the scaffolding once the building stands.

Quantization: spend fewer bits per number

The cheapest, most universal lever is [[quantization-ml|quantization]]: store and compute each weight with fewer bits. A weight trained as a 32-bit float carries far more precision than the model actually uses, so you round it to a smaller format — 8-bit integers (INT8) are routine, and 4-bit is now common for LLMs. Going from 16-bit to 4-bit cuts the memory footprint roughly fourfold, which is often the difference between a model that fits on one GPU and one that does not.

Don't confuse this with the mixed-precision training you met earlier. There, lower precision speeds up the math while a full-precision copy keeps training stable. *Post-training quantization* happens after the model is frozen: you map ranges of float values to a small grid of integers, store a scale factor per group of weights, and reconstruct an approximate float on the fly. The art is choosing the grid — naive rounding wrecks accuracy, because a few rare outlier weights have huge magnitudes and squash everything else into one bucket.

32-bit float weight   ->   4-bit integer + per-group scale
  0.0731  0.0024 ...        [ 9, 0, 14, 3, ... ]   x  scale=0.0081
  ~4 bytes each            ~0.5 byte each
  full precision          ~8x smaller, tiny rounding error

Quantization rounds each weight to a coarse integer grid, storing one shared scale per group to reconstruct an approximate value at run time.

How far can you push it? For most models INT8 is nearly lossless; 4-bit costs a little quality but is usually worth it; below that, accuracy tends to fall off a cliff. If even careful rounding hurts too much, *quantization-aware training* fine-tunes the model with the rounding simulated in the loop, so gradient descent learns weights that survive the squeeze. The payoff is real: a quantized model needs less memory, moves fewer bytes, and on integer-friendly hardware runs faster — improving both latency and throughput at once.

Pruning: throw away the dead weight

[[pruning|Pruning]] attacks size from a different angle: instead of cheapening every number, it deletes numbers entirely. Many trained weights sit near zero and contribute almost nothing to the output, so you set them to zero and skip them. *Unstructured* pruning zeroes individual weights — it can drop a large fraction with little accuracy loss, but a scattered, sparse matrix is hard for a GPU to speed up, so you save memory without much speed.

*Structured* pruning is what usually pays off in production: you remove whole units — entire neurons, attention heads, or layers from the transformer. The remaining model is smaller *and* dense, so ordinary hardware runs it faster with no special tricks. The classic recipe is iterative: prune a slice, fine-tune to let the survivors recover, measure, and repeat. Prune too aggressively in one shot and the model never recovers; do it gradually and you can often shed a third of the network with barely a dent.

Distillation: teach a small model what a big one knows

[[knowledge-distillation|Knowledge distillation]] is the most ambitious lever. Rather than shrink the trained model, you train a brand-new, smaller *student* to imitate a large *teacher*. The trick is what the student copies: not just the teacher's final answer (the hard label), but its full probability distribution over every option — the *soft labels*. When a teacher says "80% cat, 15% dog, 5% fox," that spread tells the student that cats and dogs look more alike than cats and foxes. Those gradations carry far more signal per example than a one-hot answer, so a small student learns from them faster and better.

This is the path behind many of the small, fast models you can actually afford to serve — the same idea labelled model distillation when the teacher is a foundation model and the student is a compact deployable one. Be honest about the ceiling, though: distillation transfers the teacher's *behavior on the data you show it*, not some essence of intelligence. The student inherits the teacher's blind spots and biases, and on tasks far from the distillation data it will simply fall short. A distilled student is a faithful echo of one capability, not a miniature general mind.

Putting it together: cheap inference in practice

These levers stack. A common production pipeline distills a capable teacher into a smaller student, prunes the student structurally, then quantizes the result to 4 or 8 bits — three multiplicative wins on size and cost. The combined umbrella term for all of it is model compression, and the discipline is empirical: each step needs its own round of measurement, because the losses compound and sometimes interact in surprising ways.

Fix a budget and a bar first: the latency, memory, and cost ceiling you must hit, and the accuracy floor (on worst-case slices, not just the average) you must not drop below.
Try quantization first — it's the cheapest, most reversible win, and INT8 alone often gets you under budget with negligible loss.
If you still need more, prune structurally with a fine-tune after each cut, or distill a smaller student from the model you have.
Compile and deploy the final model through a portable runtime so it runs fast on your target hardware, edge or server.
Re-measure end to end and keep watching after launch — a compressed model can degrade differently than its parent as data drifts.

Two practical companions close the loop. A portable serving format like ONNX Runtime lets you export the compressed model once and run it across CPUs, GPUs, and accelerators without rewriting it — and small enough models open the door to edge deployment, running on a phone or sensor with no round-trip to the cloud at all. None of this is set-and-forget: keep your compressed model under the same monitoring as any other, because the cliff you avoided in testing can still appear in the wild.