Learning from Reward

A different way to learn

Everything you have learned on this ladder so far has had a teacher standing behind it. In supervised learning there is a dataset of examples, each carrying the correct label, and the model is nudged toward those answers by gradient descent. Reinforcement learning throws that teacher away. There is no answer key — only a world to act in and a faint signal that says, after the fact, whether things went better or worse.

Think of teaching a dog to fetch. You never hand it the equation for fetching. It tries something, you say "good" or nothing at all, and over many repetitions it shapes its own behaviour around what earns praise. Reinforcement learning formalises exactly this: an agent learns to behave by interacting with the world and chasing reward. It is learning by doing, and it is the one branch of machine learning that most resembles how animals — and children — actually learn.

The agent–environment loop

Strip away the details and reinforcement learning is one loop running over and over. The agent–environment loop has two characters. The agent observes the current situation, picks something to do, and the environment — everything outside the agent — receives that choice and hands back a new situation plus a number. That number is the reward. Then the loop turns again. Tick by tick, the agent builds up experience about which choices in which situations tend to pay off.

Three words name the moving parts. The state is what the agent knows about the situation right now — the board position in a game, the joint angles of a robot arm, the pixels on a screen. The action is the choice it makes from the options available in that state. The reward is the scalar feedback the environment returns. State in, action out, reward and next state back: that is the whole conversation.

loop:
  state   = environment.observe()
  action  = agent.decide(state)        # the policy
  reward, next_state = environment.step(action)
  agent.learn(state, action, reward, next_state)
  state = next_state

The whole of RL is this loop: observe a state, choose an action, receive a reward and the next state, learn, repeat.

The rule the agent uses to turn a state into an action is its policy — the thing we are ultimately trying to learn. A policy can be a lookup table, or it can be a neural network mapping pixels to button presses. When the environment behaves like a Markov decision process — meaning the next state depends only on the current state and action, not the whole history — this loop becomes mathematically tractable, which is why almost all RL theory is built on top of it.

Reward is not the same as the goal

Here is the subtle part. The agent does not chase the immediate reward — it chases the total reward collected over the long run, called the return. A move that scores nothing now can be the move that wins the game ten steps later. To compare a reward today against a reward far in the future, RL uses a discount factor, a number slightly less than one that shrinks rewards the further off they are. This keeps the sum finite and gently encourages the agent to prefer sooner payoffs without ignoring the future.

This long-horizon credit problem is what makes RL hard and interesting. If you lose a chess game after forty moves, which move was the mistake? The reward arrives all at once at the end, yet it must somehow be spread back across every decision that led there. The whole machinery of value functions and the algorithms in the rungs ahead exists to solve this one puzzle: assigning credit and blame to actions whose consequences are delayed.

Explore or exploit?

Because there is no teacher, the agent faces a dilemma a supervised model never meets. To earn reward it should pick the action it currently believes is best — that is exploitation. But its beliefs come only from what it has tried, so the genuinely best action might be one it has barely sampled. To find out, it must sometimes pick something that looks worse — that is exploration. This is the exploration–exploitation tradeoff, and balancing it is unavoidable.

You live this every time you order at a familiar restaurant. Do you get the dish you know is great (exploit), or try the special and risk disappointment for the chance of a new favourite (explore)? The cleanest miniature of this problem is the multi-armed bandit: several slot machines with unknown payouts, and a fixed budget of pulls. Pull only the machine that has paid best so far and you may never discover a better one; pull at random and you waste money learning what you already knew.

The simplest workable answer is epsilon-greedy: most of the time take the best-known action, but with a small probability epsilon pick at random instead. Start with a lot of randomness while you know little, and let epsilon decay as your estimates sharpen. It is crude, but it captures the whole idea — and it is enough to make the value-based methods in the next rungs actually work.

How RL really differs from supervised learning

It is tempting to file RL as just another flavour of supervised learning with a weird loss. It is not, and three differences run deep. First, the feedback is evaluative, not instructive: reward tells you how good your action was, never what the right action would have been. A label in supervised learning points straight at the answer; a reward only grades the one you tried.

Second, the data is not given — the agent generates its own. A supervised model trains on a fixed dataset that sits still while it learns. An RL agent's experience comes from its own choices, so a bad early policy collects bad data, which teaches a bad policy: a feedback loop with no equivalent in supervised land. Third, the examples are not independent. Each state flows from the last, so RL squarely breaks the comfortable assumption that your data points are independent and identically distributed.

What to keep, and what comes next

The agent acts, the environment responds with a reward and a new state, and this loop repeats — that is reinforcement learning in one sentence.
The goal is to maximise long-run return, not immediate reward; a discount factor weighs the future against the present.
Every step the agent must balance exploiting what it knows against exploring what it doesn't.
Reward is evaluative not instructive, the agent makes its own data, and consecutive experiences are correlated — none of which is true of supervised learning.

One honest caveat before you climb on. RL is powerful but notoriously finicky: it can need enormous amounts of trial and error, and the celebrated results — mastering Go, controlling plasma in a reactor — lean on millions or billions of practice episodes, usually in a simulator where failure is cheap. It is not a general recipe for autonomy, and an agent optimising a reward is not pursuing goals in any human sense. Keep that grounded picture as you go.

From here the rung gets concrete. Next we turn the fuzzy idea of "how good is this state" into a precise value function, meet the Q-function that scores state–action pairs, and use the Bellman equation to tie a value to the values that follow it. Those tools unlock Q-learning and the policy methods that, fused with deep networks, became the deep Q-network behind the game-playing breakthroughs. You now have the loop in your head — everything else is learning to close it well.