From rewards to a sense of worth
In the last rung you met the observe-act-reward loop: an agent sees a state, picks an action, lands in a new state, and collects a reward. The catch is that rewards are short-sighted. A move that earns nothing right now — a pawn sacrifice, a detour around a wall — might be exactly what wins the game ten steps later. So the agent needs something deeper than the immediate reward: it needs a sense of how *good a situation is*, all things considered.
That sense of worth is the value function. The value of a state is the total reward you can expect to gather from that state onward, assuming you keep following some policy (your way of choosing actions). Because the future stretches on, we don't add up rewards naively — we apply a discount factor so that a reward arriving sooner counts a little more than the same reward arriving later. The discounted sum of all future rewards has its own name, the return, and value is simply the *expected return*.
The Q-function: value with an action attached
Knowing a state is valuable is nice, but to *act* you need to compare your options. That's what the Q-function does. The Q-value Q(s, a) answers a sharper question: 'If I take action a right now in state s, and then behave well afterward, how much total return should I expect?' The Q in Q-learning stands for *quality* — the quality of a state-action pair.
This small shift is enormously practical. If you hold a good Q-function, choosing how to act becomes trivial: look at the current state, score every available action, and pick the highest. No planning, no search through the future — the Q-values have already folded all of that in. A state's value is just the Q-value of its best action, which ties the two functions neatly together.
All of this rests on a quiet assumption from the Markov decision process you saw earlier: the state captures everything that matters about the future. Given the present state, the past adds nothing. That's what makes a single number Q(s, a) — with no memory of how you arrived — enough to act on.
The Bellman equation: value defined by its own future
Here's the trick that makes all of this computable. The value of where you are can be split into two pieces: the reward you get right now, plus the (discounted) value of wherever you land next. That self-referential identity — value defined in terms of value, one step away — is the Bellman equation. It turns an impossible-looking sum over an infinite future into a relationship between neighboring states.
Q(s, a) = r + γ · max Q(s', a')
a'
r reward received now
s' the state you land in next
γ discount factor (0 < γ < 1)
max assume you act greedily afterwardRead it slowly: the best Q-value at (s, a) equals the immediate reward r, plus γ times the best Q-value available at the next state s'. If you knew the true Q-function, both sides would match perfectly. So the gap between the two sides — the temporal-difference error — is a measurable signal of how wrong your current estimate is. Shrinking that gap, over and over, is the whole game.
Q-learning: improving by surprise
Q-learning turns the Bellman idea into a practical recipe. Keep a table (or later, a network) of Q-value estimates, all starting as rough guesses. Then live in the world: take an action, watch the reward and the next state, and nudge your estimate a small step toward what the Bellman equation predicted. The size of that nudge is set by a learning rate. Each experience is a little correction; thousands of them sand the estimates down toward the truth.
- Observe the current state s and pick an action a (mostly your best guess, sometimes a random one — see below).
- Do it. Record the reward r you got and the new state s' you landed in.
- Compute the Bellman target: r + γ × (best Q-value at s').
- Move Q(s, a) a small step toward that target. Repeat for the rest of your life.
Notice what Q-learning never needs: a map. It doesn't know in advance which states actions lead to or how rewards are doled out. It learns purely from lived transitions. That makes it model-free — the opposite of model-based RL, where the agent first builds a model of the world's dynamics and plans against it. Model-free methods are simpler and robust to messy worlds, but they pay for it in raw experience: they often need a great many episodes to learn what a good model could reason out.
Epsilon-greedy: making room for surprise
There's a trap hiding in 'always pick the highest Q-value.' Early on your estimates are wrong, so the action that *looks* best may just be the one you happened to try first. If you only ever exploit your current best guess, you never discover the genuinely better move sitting one door over. This is the exploration-vs-exploitation tension at the core of all learning by trial and error.
The simplest fix is also one of the most durable: epsilon-greedy. Most of the time (probability 1 − ε) take the action your Q-values rate highest — exploit. But a small fraction of the time (probability ε) ignore them and act at random — explore. A common trick is to start with ε large, so the agent roams freely while clueless, then shrink ε over time as its judgments sharpen, settling into mostly-greedy behavior.
On-policy vs off-policy: who you learn about
Here's a subtlety worth slowing down for. While exploring, the agent sometimes acts randomly — but what is it actually *learning about*? This is the on-policy vs off-policy distinction. Q-learning is off-policy: it explores with epsilon-greedy, yet its Bellman target uses max — the value of the *best* next action, not the one exploration might force it to take. It behaves one way while learning the value of behaving optimally. The policy it improves and the policy it acts with are allowed to differ.
Its close cousin SARSA is on-policy: instead of max, it plugs in the value of the action it *actually* took next — random exploration and all. So SARSA learns the value of the cautious, occasionally-clumsy policy it really follows, while Q-learning learns the value of the bold optimal policy it aspires to. Near a cliff, SARSA tends to keep a safer margin (because it knows it will sometimes stumble), while Q-learning hugs the edge. Neither is 'right' — they answer different questions.
Why care? Off-policy learning is what lets an agent learn from *old* experience — replays of past episodes, even another agent's logs — because it isn't tied to whatever policy generated the data. That property is exactly what makes deep Q-networks, which replace the Q-table with a neural network, practical: they store a buffer of past transitions and learn from them repeatedly. That's the bridge into the deep RL of the final rung.