Serving Models at Low Latency

Inference is a different job than training

By now you have trained models: you ran gradient descent over a dataset, watched the loss fall, and saved the weights. Serving is the other half of the life cycle, and the distinction between training and inference runs deep. Training happens once (or periodically), offline, in big jobs where you happily wait hours; inference happens millions of times, online, while a person or another system is waiting on the answer.

At inference time there is no backward pass: no backpropagation, no gradients, no optimizer state. You just run the forward pass and read the output. That sounds cheaper, and per call it is — but the economics flip, because you pay for it constantly. A model trained once for a fortnight may then serve for years, so the lifetime inference cost usually dwarfs the training bill.

Latency and throughput pull in opposite directions

Two numbers govern a serving system, and the tension between latency and throughput is the central trade-off. Latency is how long one request takes — the wait a single user feels. Throughput is how many requests you finish per second across all users. They are not the same thing, and pushing on one often hurts the other.

Report latency as a distribution, not an average. The number that matters is usually a tail percentile — the p99, the slowest 1% of requests — because that is what your unluckiest users feel, and averages quietly hide it. A service can look fast on average while a tenth of users wait two seconds. Set a *budget* (say, p99 under 200 ms) and treat every optimization as buying headroom against it.

Why the tug-of-war? A GPU is happiest doing a big matrix multiply all at once. Run requests one at a time and the chip sits mostly idle between them — low throughput, but each answer comes back fast. Group requests together and the chip stays busy — high throughput — but the first request now waits for its groupmates. That grouping is the single biggest lever in serving, so it gets its own section.

Batching: feed the chip, mind the wait

Request batching stacks several inputs into one tensor and runs them through the network together. Because the weights are loaded from memory once and reused across the whole batch, you do far more useful math per byte moved — and on modern accelerators, moving memory, not doing arithmetic, is usually the bottleneck. Bigger batches mean better throughput, right up until you run out of memory or blow your latency budget.

The simplest version, *static batching*, waits to collect N requests (or until a small timeout) and then runs them. But for a large language model that generates one token at a time, static batching is wasteful: short replies finish early and their slots sit idle while the longest reply drags on. *Continuous batching* fixes this by swapping a finished request out and a fresh one in at every step, keeping the batch full. This, plus the KV cache that stores past attention keys and values so each new token is cheap, is why modern LLM servers reach the throughput they do.

static batching:    [req A]......done  (slot idle)
                    [req B]...............done
                    waste = idle slots while B finishes

continuous batching: A finishes -> C jumps into A's slot
                     batch stays full every step

Continuous batching refills empty slots step by step instead of waiting for the whole batch.

The vector database: serving facts, not just weights

A model's weights are frozen at training time, so they cannot know today's news or your private documents — and asking them to recall specifics invites confident fabrication. The fix is to *retrieve* relevant text at request time and hand it to the model as context. That is retrieval-augmented generation, and its engine is the vector database.

Here is the mechanism. Every document is turned into an embedding — a vector that places similar meanings near each other in space. A vector database stores millions of these and answers one question very fast: *which stored vectors are nearest to this query vector?* Nearest in the embedding's geometry means closest in meaning, so you fetch the passages that are actually relevant, not just ones sharing keywords.

Scanning every vector exactly would be too slow at scale, so these databases use approximate nearest-neighbor indexes (graph-based ones like HNSW are common). They return *almost* the best matches in a fraction of the time — a deliberate accuracy-for-speed trade, the same bargain you keep meeting in serving. For low latency, the retrieval step has its own budget: keep the index in memory, and remember that the context window you stuff with retrieved text is not free, since every extra token costs compute on every generated token.

Edge deployment: moving the model to the data

So far we have assumed the model lives in a datacenter. Edge deployment flips that: the model runs on the phone, the camera, the car, the sensor — right where the data is born. The motive is rarely raw speed alone. It is the round trip you delete (no network hop to a server), the data you never send (privacy, and it works offline), and the bandwidth and server bill you stop paying.

The catch is the budget. An edge device has a fraction of a datacenter GPU's memory and power, so the model often has to shrink first — through quantization (storing weights in 8 or 4 bits instead of 32) and other compression tricks covered in the next guide. A portable runtime such as ONNX Runtime then lets one exported model run across the messy zoo of edge chips. Less precision can mean a small accuracy dip, so you measure it — never assume it is free.

Putting it together

Low-latency serving is a chain of honest trade-offs rather than a single trick. You start from a frozen model, set a latency budget against the percentile that matters, then spend it deliberately: batch to feed the chip, cache to avoid repeated work, retrieve to ground the answer in fresh facts, and compress or move the model to the edge when the round trip itself is the cost. None of these are silver bullets — each buys one thing by paying with another.

And shipping the service is not the finish line. Real-world inputs drift away from your training distribution over time, so the next guides cover shrinking models further and watching them with monitoring for data and concept drift. A model serving fast but quietly wrong is worse than one that is slow — speed is only worth chasing once the answers stay correct.