Deep RL & Self-Play

From tables to networks

In the earlier rungs you built reinforcement learning on a foundation of tables: a Q-function was literally a grid with one cell per (state, action) pair, updated by Q-learning until the numbers stopped moving. That works beautifully for gridworlds and small games. It collapses the instant the state is a screen of pixels or a board with more positions than there are atoms in the universe — you cannot store a cell for every one, and you would never visit most of them twice anyway.

The fix is the same idea that powered the deep learning rungs: stop storing the answer and start *approximating* it. Replace the table with a neural network that takes a state in and outputs estimated values. The network generalizes — states it has never seen get sensible estimates because they resemble states it has. This is the whole leap of *deep RL*: the agent's value or policy is now a function the network learns, not a lookup it memorizes.

The Deep Q-Network

The breakthrough that made this concrete was the [[deep-q-network|Deep Q-Network]] (DQN), which in 2013–2015 learned to play dozens of Atari games straight from raw pixels, using one architecture and one set of hyperparameters across all of them. A convolutional network read the screen; its outputs were the estimated Q-values for each joystick action; the agent picked actions with an epsilon-greedy rule to keep exploring. The training target was still the Bellman idea you already know — make Q(state, action) match the reward plus the discounted best next value.

Naively bolting a network onto Q-learning is unstable: the network's own shifting estimates form the target it chases, so it can spiral. DQN's two famous tricks tamed this. *Experience replay* stores past transitions in a buffer and trains on random minibatches, breaking the correlations in a sequential stream. A separate, slowly-updated *target network* supplies the next-state value, so the target stops wobbling every step. Both are engineering, not magic — they make the moving target sit still long enough to learn.

AlphaGo and the engine of self-play

Where DQN reacted move-by-move, [[alphago|AlphaGo]] had to plan in a game whose branching factor had defeated the search-driven approach that won at chess for Deep Blue. Its design fused two pieces you have already met: deep networks that estimate a value (who is winning?) and a policy (which moves are worth considering?), feeding a tree search so it only explored promising lines. The networks turned a hopeless search into a focused one.

The most important idea, though, is [[self-play|self-play]]. The first AlphaGo bootstrapped from human expert games, but its successor, AlphaGo Zero, started from random play and learned by playing *against copies of itself*, millions of times. Each game is free training data with a built-in opponent of exactly matched strength, so the difficulty ramps automatically as the agent improves — a natural curriculum that no human could hand-design. From only the rules, with no human games at all, it surpassed every prior version.

Where deep RL struggles

The first and harshest limit is sample efficiency. Because the agent learns from a scalar reward rather than a labeled target, it must try things, mostly fail, and slowly correlate actions with delayed outcomes. That can take millions to billions of trials. In a simulator that is fine; in the physical world, where each trial is a real robot arm or a real dollar, it is often prohibitive. This is why a child learns a video game far faster than DQN does — and why claims that deep RL is 'how humans learn' overstate things.

The second is reward design. The agent optimizes exactly the number you wrote down, not the intention behind it. Specify it loosely and you get reward hacking: a boat-racing agent that endlessly circles to collect bonus pellets instead of finishing the race, because circling scored higher. Reward shaping — adding hints to guide early learning — helps, but each hint is a new chance to encode the wrong goal. Getting a reward exactly right is genuinely hard, and it is the same difficulty that sits under the alignment problem for larger systems.

reward = +1  if race_finished else 0      # what you meant
reward = +1  per pellet collected         # what it optimizes
# agent: spin in circles forever, never finish, score = huge

A toy of the classic reward-hacking trap: the proxy reward and the true goal quietly diverge.

Deep RL is also brittle and high-variance: small changes to hyperparameters, random seeds, or the reward can swing results from superhuman to useless, and published gains can be hard to reproduce. Practitioners answer with families of methods — policy gradient and actor-critic approaches like PPO, plus model-based tricks that learn a simulator to cut real trials, and imitation learning to warm-start from human demonstrations.

What it really means — and what it doesn't

It is tempting to read AlphaGo as a step toward general intelligence or even superintelligence. Be precise: AlphaGo is a stunning *specialist*. It cannot be asked to play chess, let alone fold laundry; a system that played Go and StarCraft was retrained, not the same mind doing both. The generality lived in the *method* — networks plus search plus self-play — not in any single agent. Mastery of one closed game is not a sign that open-ended human-level competence is near.

Still, the lesson echoes loudly. Self-play is an instance of *the bitter lesson* — that general methods which scale with computation tend to beat methods hand-engineered with human knowledge. And the value-and-policy machinery did not stay in the arcade: the same reward-driven fine-tuning, RLHF, is now how large language models are shaped to be helpful. Deep RL is a sharp, powerful tool with a narrow blade. Knowing exactly where the blade cuts — fast simulators, crisp rewards, deep pockets of compute — is the difference between using it well and being dazzled by the demo.