Fine-Tuning & RLHF: Making Models Helpful

The raw model is a mimic, not an assistant

By the end of pretraining, a large language model has read an ocean of text and learned to do exactly one thing: predict the next token. That single skill is astonishing — it absorbs grammar, facts, code, and reasoning patterns along the way. But the skill it actually optimizes is *imitation*. Hand the raw model the words "How do I bake bread?" and it may well continue with "How do I make pasta? How do I roast a chicken?" — because on the open web, questions are often followed by *more questions*, not answers.

This is the gap that the rest of this guide closes. The pretrained model already *knows* how to bake bread — the knowledge is sitting in its weights. What it lacks is the *habit* of answering when asked, of being honest instead of plausible, of refusing harmful requests. Turning a next-word mimic into a usable assistant is not about pouring in more facts. It is about reshaping its behavior so the knowledge it already has comes out when you want it.

Instruction tuning: teaching the habit of answering

The first and simplest fix is instruction tuning — a form of fine-tuning where we keep training the pretrained model, but now on a curated set of (instruction → ideal response) pairs written or vetted by humans. "Summarize this article: …" paired with a crisp summary. "Translate to French: …" paired with the translation. Tens of thousands to millions of such examples, spanning every task you want the model to handle.

Mechanically, nothing exotic happens here. It is the same next-token prediction objective and the same gradient descent used in pretraining — just on a tiny, hand-picked dataset instead of the raw web. Because the model already learned language and facts, a relatively small dose of these examples is enough to flip its default behavior from "continue the text" to "follow the instruction." This is a vivid case of transfer learning: almost all the heavy lifting was done in pretraining, and instruction tuning just redirects it.

Instruction tuning alone already produces something that feels like an assistant: it answers questions, follows formats, switches tasks on command. But it has a ceiling. To teach by example, a human has to *write* the ideal answer — and for many questions there is no single ideal answer, just a spectrum from "better" to "worse." How polite? How detailed? How cautious? You can't easily demonstrate a vague preference. That limitation is exactly what the next step is built to handle.

RLHF: learning from preferences, not perfect answers

Here is the key insight behind RLHF (Reinforcement Learning from Human Feedback): people are bad at *writing* the perfect answer but very good at *comparing* two answers. So instead of asking humans to author ideal responses, we ask the model to generate two candidate answers and ask a human only "which is better?" Those judgments are cheap, fast, and capture the fuzzy preferences — tone, helpfulness, safety — that no single demonstration could.

Those comparisons train a second network, the reward model, whose job is to score any response with a single number predicting how much a human would like it. Then we use reinforcement learning to nudge the language model toward responses the reward model scores highly. The model becomes the agent, its replies are the actions, and the reward model's score is the reward — the standard algorithm here is PPO. In effect, the reward model is a learned, automated stand-in for a human rater, so the language model can practice millions of times without a person in the loop for every turn.

pretrain  ->  instruction tune  ->  collect A/B preferences
                                          |
                                   train REWARD model
                                          |
              RL: model proposes answer  -> reward scores it
                  -> nudge model toward higher-scored replies

The classic RLHF pipeline: a base model is refined in stages, with a learned reward model standing in for human judgment during the reinforcement-learning loop.

DPO: the same goal without the reinforcement learning

RLHF works, but it is fiddly: you train a separate reward model, then run an unstable reinforcement-learning loop on top of it, juggling several moving parts at once. Direct Preference Optimization (DPO) asks a sharp question — if all we have is "answer A is preferred over answer B," do we really need the reward model and the RL machinery in between?

DPO's answer is no. It uses a clever bit of math to fold the reward model directly into the loss function, so you can train on the preference pairs with ordinary supervised-style gradient descent — the model simply learns to raise the probability of the preferred answer and lower the probability of the rejected one. There is no separate reward network and no RL loop. For many teams DPO is now the default first reach, because it is simpler, more stable, and cheaper to run, while reaching quality close to a well-tuned RLHF pipeline.

Do not read DPO as "RLHF was wrong." Both pursue the *same* goal — bend the model toward what humans prefer — and both rely on exactly the same precious resource: human preference data. They differ in machinery, not in spirit. The honest summary is that the field is still actively figuring out which method wins under which conditions, and real systems often mix several techniques rather than pick one.

Alignment is shaping behavior — not adding intelligence

Step back and see what all three techniques have in common. None of them taught the model new facts about the world; the knowledge was already there after pretraining. What they did was *shape behavior* — make the model answer rather than ramble, be honest rather than merely plausible, decline harmful requests, and match a tone people find helpful. This shaping is what people mean by alignment: getting a model's behavior to track what its makers and users actually intend.

This is why a common belief gets the picture backwards. Fine-tuning and RLHF do not, in any large way, make a model *smarter* — they make it *easier to work with*. A friendly, well-aligned model and a blunt, raw one can share almost the same underlying capability; the difference you feel is mostly behavior. Conversely, alignment does not magically appear with scale. A bigger model is a more capable mimic, but it is not automatically more honest or more obedient — those traits have to be deliberately trained in, every time.

With the model now both knowledgeable and helpful, one piece of the GenAI story remains. We have shaped *what* the model tends to say; the next guide turns to *how* those shaped probabilities get turned into actual prose, through sampling choices like temperature and top-p. The pretrained brain, the aligned manners, and the sampling voice together are what you meet when you open a chat window.