Retrieval-Augmented Generation (RAG)

The problem: a brilliant intern with no access to your files

By now you understand that a large language model is a next-token predictor trained on a frozen snapshot of text. That snapshot is the model's entire world. Ask it about your company's refund policy, last week's incident report, or a contract it has never seen, and it has two choices: admit it doesn't know, or — far more often — produce a fluent, plausible, completely fabricated answer. That confident fabrication is called hallucination, and it is not a bug you can prompt away.

Think of the model as a brilliant new intern who has read most of the public internet but has zero access to your filing cabinet. The intern is not lying on purpose; they simply fill gaps with what sounds right. The fix is obvious and human: before asking the question, hand them the relevant pages. Retrieval-Augmented Generation is exactly that — fetch the right documents, paste them into the prompt, then ask. The model reasons over text you supplied instead of guessing from memory.

Embeddings: turning meaning into coordinates

To hand the model the right pages, you first have to find them. Keyword search fails here: a user asks "how do I get my money back?" but your policy says "refund eligibility." No words overlap, yet the meaning is identical. The trick that makes RAG work is the embedding — a model that converts any chunk of text into a list of numbers (a vector) so that texts with similar meaning land close together in space, regardless of the exact words.

You met this idea early on the ladder with word2vec: "king − man + woman ≈ queen." Modern embedding models do the same for whole sentences and paragraphs, using the same transformer machinery as the LLM itself. Each text becomes a point in a space of hundreds or thousands of dimensions. "Closeness" is then just geometry — typically the cosine similarity between two vectors, which measures the angle between them rather than their length.

query:  "how do I get my money back?"   -> [0.21, -0.05, 0.88, ...]
chunk:  "Refund eligibility and process"  -> [0.19, -0.07, 0.85, ...]
chunk:  "Office holiday schedule 2026"     -> [-0.4,  0.6,  0.02, ...]

cosine(query, refund_chunk)  = 0.94   <- near, retrieve this
cosine(query, holiday_chunk) = 0.11   <- far, ignore

Different words, same meaning land close; unrelated text lands far. Retrieval is just "find the nearest vectors."

The vector database and the retrieval pipeline

If you have ten documents you could compare the query against every one. But a real corpus has millions of chunks, and comparing the query to each on every request would be hopelessly slow. A vector database solves this: it stores all your embeddings and, given a query vector, returns the nearest neighbors in milliseconds using approximate-nearest-neighbor indexes. It is the search engine that makes retrieval over large data practical.

The whole thing runs in two phases. First, offline, you do the slow work once: load your documents, split them into chunks, embed each chunk, and store the vectors. Then, online, every user question triggers the fast loop: embed the question, find the nearest chunks, paste them into the prompt, and ask the model to answer using only that supplied text.

Chunk — split each document into passages of a few hundred words. Too big and you waste context window and dilute relevance; too small and you sever the meaning across the cut.
Embed & index — run every chunk through the embedding model and store the vectors (plus the original text) in the vector database.
Retrieve — at question time, embed the query and pull back the top-k most similar chunks (often k = 3 to 10).
Augment & generate — stitch the retrieved chunks into the prompt with an instruction like "answer using only the context below," then let the LLM write the final answer, ideally citing which chunk it used.

Why this cuts hallucination — and why it doesn't eliminate it

RAG reduces hallucination because it changes the task. Without it, the model must recall a fact from compressed, lossy memory — a recall task it often fails. With it, the model performs reading comprehension: the answer is sitting right there in the prompt, so "copy and rephrase the relevant sentence" is far easier and far more reliable than "remember everything you ever read." As a bonus, you can show the user the source passage, which makes the answer auditable in a way a bare LLM never is.

But be honest about the limits. RAG is only as good as its retrieval. If the right chunk never gets fetched — bad chunking, a weak embedding model, an ambiguous query — the model answers from the wrong context or falls back on its old guessing. And even with perfect context, a model can still contradict it, misread it, or blend it with a stale memory. RAG dramatically lowers the hallucination rate; it does not drive it to zero. Treat "grounded" as "better sourced," not "guaranteed true."

Building one well — and what to ignore

A first RAG system is genuinely easy to build, and that is its charm — a weekend gets you a working prototype. Making it reliable is the real craft, and almost all of that craft lives in retrieval, not the model. Spend your effort on how you chunk (respect headings and paragraphs), how you embed (pick a strong, current embedding model), and on a re-ranking step that re-scores the top candidates with a sharper model before they reach the prompt. A hybrid of keyword search and vector search usually beats either alone.

Evaluate it like the engineering artifact it is, not by vibes. Build a small set of real questions with known answers and measure two things separately: did retrieval surface the right chunk (a search-quality question), and did the model answer faithfully from it (a generation-quality question)? Keeping these two scores apart tells you where to fix — a habit the next guides on evaluation lean on heavily. A simple keyword baseline is also worth keeping around to prove your vector pipeline actually earns its complexity.

Finally, resist two pieces of hype. RAG is not a stepping stone to AGI or some new form of memory — it is plumbing that fetches text. And as context windows grow huge, people ask whether you can skip retrieval and just paste everything in. Sometimes, for a single small document, you can. But for any real corpus, dumping millions of tokens is slower, costlier, and actually less accurate, because models attend worse to facts buried in a giant haystack. Retrieval stays useful precisely because choosing what to read is the whole point.