JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
All guides

Chain-of-Thought & Reasoning

Why telling a model to "think step by step" can change a wrong answer into a right one — and why the visible reasoning is a tool, not a window into the model's mind.

The trick that changed prompting

You already know that a large language model predicts the next token, and that the right prompt can steer it through in-context learning without changing a single weight. Chain-of-thought prompting is the most important refinement of that idea. Instead of asking for the answer, you ask the model to write out its intermediate steps first — and only then conclude. The phrase that started it all, "Let's think step by step," really does lift accuracy on multi-step problems.

Why would padding the answer with words help? A model has a fixed amount of computation per token. Asking for a number straight away forces all the reasoning into one forward pass. When the model writes out steps, each new token becomes scratch space it can read back, so a hard problem gets spread across many small, cheap computations instead of one impossible one. The reasoning text is, quite literally, the model thinking out loud on paper it can re-read.

When step-by-step actually helps

The honest rule is: chain-of-thought helps when the answer requires composing several pieces, and it does nothing — or slightly hurts — when it doesn't. Arithmetic word problems, multi-hop questions, code that must satisfy several constraints, logic puzzles: these reward stepping out the work. Sentiment of a tweet, a fact you either know or don't, a single lookup: these gain nothing, and the extra text just costs you tokens and latency.

There is also a size effect worth being precise about. Reliable step-by-step reasoning shows up mainly in larger models; ask a tiny one to reason and it often just generates plausible-sounding nonsense and then a wrong answer. This is the kind of thing people loosely call an emergent ability — but treat that label with care. Much of the apparent "jump" is an artifact of harsh all-or-nothing scoring; on smoother metrics the gains often look gradual, not magical. The practical takeaway stands: don't expect chain-of-thought to rescue a small model.

A simple companion technique is self-consistency: instead of one chain, sample several with a little randomness, then take the answer the majority of chains agree on. Different reasoning paths that land on the same destination are more trustworthy than one lucky path. It costs more, but for high-stakes questions it is one of the cheapest reliability gains you can buy.

Reasoning models: training the habit in

In a normal model you have to coax the steps out with a prompt. The newer generation of reasoning models has the habit trained in: they automatically produce a long internal chain — often hidden from you, billed as separate "thinking" tokens — before answering. They are typically post-trained with reinforcement learning that rewards reasoning traces leading to correct, checkable answers (math, code, proofs), going well beyond ordinary RLHF that mainly tunes for human-preferred style.

This is a genuine shift, not just hype: on hard math and competitive coding the gains are large and repeatable. But it buys reliability with money and time. A reasoning model can spend thousands of hidden tokens before its first visible word, so it is slower and more expensive. Match the tool to the job — a reasoning model for a tricky proof or a gnarly refactor, a fast standard model for drafting an email or classifying tickets.

The reasoning you see is not the reasoning that happened

Here is the most important caveat in this whole guide. A chain-of-thought is generated text, produced the same way every other token is. It is not a transcript of the computation that actually produced the answer. Models can reach a correct answer while writing reasoning that is partly wrong, and — more troubling — can write a clean, confident chain that rationalizes an answer they arrived at for entirely different reasons. The explanation is sometimes a story told after the fact.

This matters because a long, articulate chain feels authoritative, and that feeling is exactly the trap. The fluency of the steps does not certify the truth of the conclusion. A confidently reasoned hallucination — invented citations, a plausible but nonexistent API, a number that follows from a wrong premise stated three steps up — is still a hallucination, just dressed for a job interview.

Q: A shop had 23 apples. It used 20 for lunch and bought 6 more.
   How many apples remain?

Thinking:
  start = 23
  after lunch = 23 - 20 = 3
  after buying = 3 + 6 = 9
Answer: 9        <- correct, and the steps are checkable

(Now imagine step 1 read "start = 32". Every later step would
 look just as tidy, and the final answer would be wrong.)
The visible steps are a checkable scratchpad — but only if you actually check the first premise, not just admire the arithmetic after it.

Using it well in practice

Reasoning is one technique in your prompt engineering toolkit, and it composes with the others you have met. Pair it with few-shot examples that show the reasoning style you want; anchor it with a clear system prompt that sets the role and the rules. And mind your context window: long chains and several worked examples eat tokens fast, which is real money and a real latency cost on every call.

  1. Ask first: does this task genuinely have multiple steps? If a single lookup or a one-line judgment, skip chain-of-thought — it only adds cost.
  2. If it is multi-step, either invoke a reasoning model or add a plain "reason step by step before answering" instruction to a standard one.
  3. For high stakes, sample several chains and take the majority answer (self-consistency); disagreement among them is itself a useful warning.
  4. Verify the conclusion against ground truth or a tool — never trust it just because the steps read smoothly.

Step back and the throughline is clear: more computation, applied at the right moment, buys more reliable reasoning — an echo of the bitter lesson and of the scaling laws you will meet on the frontier rungs. But computation is not comprehension. The next guide takes the natural next step: when the model lacks a fact, stop hoping it will reason its way there and instead hand it the source, through retrieval.