Fine-Tuning vs Prompting vs RAG

Three knobs, not a ladder

By now you know a large language model is a frozen pile of parameters that, at inference time, turns your text into more text. The catch is that a base model knows a great deal about the world in general and nothing about *your* world — your documents, your tone, last Tuesday's price list. There are exactly three places you can intervene to fix that, and it helps enormously to see them as separate knobs rather than a quality ladder you climb.

First, prompting: change the text you send in. Second, retrieval (RAG): fetch relevant facts at run time and paste them into that same prompt. Third, fine-tuning: actually move the weights with more training. The first two leave the model untouched and work entirely inside the context window; only the third changes the model itself. That single distinction — does the model's memory change, or just its current view? — is the spine of everything below.

Prompting: the cheapest, fastest knob

Prompting exploits a property you met earlier — in-context learning. A capable model can pick up a new task from instructions and a few examples placed right in the prompt, with no weight changes at all. Good prompt engineering — a clear system prompt, a worked example or two, and for multi-step reasoning a nudge toward chain-of-thought — closes a surprising amount of the gap between a model's default behavior and what you want.

Why start here? Because the iteration loop is measured in seconds, the cost is just the tokens you send, and you can change your mind freely. There is no training run to wait on, no dataset to label, no separate model to host. When a new and stronger base model ships next quarter, a prompt-only system usually just inherits the upgrade. That flexibility is worth real money.

Prompting has real limits, though, and they are not stylistic. It cannot teach the model facts it never saw in pretraining — ask about your internal API and it will hallucinate a confident, wrong answer. Long prompts cost more tokens every single call and eventually crowd the context window. And prompting alone rarely produces a deeply consistent voice or format across thousands of edge cases. When you hit those walls, you reach for the other two knobs.

RAG: give it the facts at run time

When the problem is *knowledge* — the model lacks facts, or the facts change too often to bake in — the right knob is almost always retrieval-augmented generation. The shape is simple: store your documents as embeddings in a vector database, and at query time fetch the handful of chunks most relevant to the user's question, paste them into the prompt, and ask the model to answer *using those passages*. The model still does the reasoning and writing; you have just handed it an open book.

user question
   |-> embed -> search vector DB -> top-k relevant chunks
   |                                          |
   +----------------> PROMPT <----------------+
                        |
                     LLM answer (grounded in the chunks, with citations)

The RAG loop: retrieve relevant chunks, drop them into the prompt, then generate an answer grounded in those passages.

RAG earns its popularity for concrete reasons. Facts stay outside the model, so updating knowledge means editing a document, not retraining anything. Answers can cite their sources, which makes them auditable. And because the model is told to lean on retrieved text, well-built RAG measurably cuts hallucination on questions it can ground. For most "chat with our docs / answer from our knowledge base" products, RAG plus solid prompting is the entire answer — no fine-tuning required.

Fine-tuning: when you must move the weights

Fine-tuning means continuing to train an existing model on your own labeled examples, nudging the weights so the new behavior gets baked in. It is a form of transfer learning: you keep the model's hard-won general competence and specialize it. The honest test for whether you need it is sharp — fine-tune when you must change *how the model behaves*, not *what it knows*. Knowledge problems go to RAG; behavior problems go to fine-tuning.

Good reasons to fine-tune: locking in a precise output format or house style across every call; teaching a narrow skill where you have many examples but it is awkward to describe in words; or shrinking your prompt — if the instructions live in the weights, you stop paying for them on every request. Most of this is instruction tuning in spirit: showing the model many (input, desired output) pairs until the pattern sticks. You rarely need to touch all the weights; lightweight methods adjust only a small slice and get most of the benefit cheaply.

The costs are why you save it for last. You need a clean, labeled dataset — and curating that is the real work, not the training. A training run takes time and compute, you now own and must host a separate model, and you risk overfitting to your examples or eroding the model's general ability. Worst of all, a fine-tuned model's knowledge is frozen at training time; when the facts change, you are retraining, not editing a document. That last point is exactly why fine-tuning is a poor fix for knowledge problems that RAG solves cleanly.

Distillation and the cost/quality trade

There is a close cousin worth knowing: distillation. Here a large, expensive "teacher" model generates high-quality outputs, and you fine-tune a small, cheap "student" to imitate them (model distillation in the LLM world). The pitch is appealing — a small model that runs fast and cheap on your narrow task, often near the teacher's quality on that slice. The trade is that the student keeps the teacher's range only where you trained it; push it outside that and quality falls off.

Step back and the whole decision is a cost/quality trade across two very different budgets. There is upfront cost: prompting is near zero, RAG needs a retrieval pipeline to build and maintain, fine-tuning and distillation need labeled data and a training run. Then there is per-call cost at inference time — your ongoing inference cost. Long prompts and big RAG contexts cost more tokens every call; a fine-tuned or distilled small model can be dramatically cheaper per call once it exists. The right choice depends on your traffic: a one-off tool and a service answering a million calls a day point to different answers.

Start with prompting. Write a clear system prompt, add a couple of examples, measure quality against a small evaluation set. Many tasks stop right here.
If failures are missing or stale facts, add RAG. Invest in retrieval quality — chunking, search, ranking — before blaming the model.
If failures are stubborn behavior — wrong format, wrong style, a skill the prompt can't pin down — and you have examples, fine-tune. Combine it with RAG if you also need fresh facts; the knobs stack.
If a fine-tuned big model works but costs too much per call at scale, distill it into a smaller student for the narrow task.

None of this is a one-time decision, and none of it replaces measurement. Whatever you ship, hold a small evaluation set beside it and watch the numbers — the next rung is exactly about evaluating and guarding what you deploy. The three knobs are not rivals; the strongest systems prompt well, retrieve the right facts, and fine-tune only the behavior that genuinely refused to move any other way.