Policy Gradients & Actor-Critic

Two roads to a good agent

By now you have met the value road. In Q-learning you learned a Q-function — a number for every state-action pair saying how good it is — and then acted by picking the highest-scoring action. The policy, the agent's rule for choosing actions, was never trained on its own; it fell out of the values as a kind of greedy by-product. That works beautifully for small, discrete action sets like the moves in a grid world.

But picture a robot arm that must choose a torque — any real number — for each of seven joints. Now "take the action with the highest Q-value" means searching over a continuous, seven-dimensional space at every single step. That search is awkward, slow, and often impossible to do exactly. The value road quietly assumed you could always find the best action cheaply once you knew its score. When that assumption breaks, we want a different idea entirely.

The other road is to learn the policy directly. Instead of scoring every action and reading off a winner, we make the policy itself a function with adjustable knobs — usually a neural network that takes a state and spits out a probability for each action (or, for continuous actions, the mean and spread of a distribution to sample from). Then we ask the question that runs through all of this rung: which way should I nudge those knobs to collect more reward?

The policy gradient: louder when it pays off

Here is the core trick of the [[policy-gradient|policy gradient]], and it is wonderfully intuitive. Let the agent run a whole episode using its current policy, sampling actions from its own probabilities. When the episode ends, look at the total return it earned. Then go back and adjust the network so that the actions taken in a good episode become *more* likely, and the actions taken in a bad episode become *less* likely. Good runs get reinforced; bad runs get suppressed. Do this over thousands of episodes and the behavior drifts toward whatever earns reward.

Mechanically this is just gradient descent again — the same engine behind every neural network in earlier rungs. The loss is built so that its gradient pushes up the log-probability of each action, scaled by how much return followed it. Big return after an action means a big push to repeat it; negative return means a push the other way. Because we follow actual sampled trajectories, this is a deeply stochastic process: every update is a noisy estimate from one batch of lived experience.

for each episode:
  run policy, record (state, action, reward) at each step
  R = total discounted return of the episode
  for each step t:
    nudge θ to raise log P(action_t | state_t) × R
# good episodes → their actions become more likely

REINFORCE, the simplest policy gradient: weight each action's push by the return that followed.

Two honest weaknesses come straight out of this design. First, you usually need to finish an episode (or a long chunk) before you can update, so it is sample-hungry and slow. Second, the return of a whole episode is a blunt verdict: if you played thirty moves and won, this naive method credits *all thirty* equally, even the three terrible ones that the good moves rescued. That coarse credit makes the gradient estimate extremely noisy — high variance, in the jargon. Taming that noise is exactly what the next idea is for.

Actor-critic: two networks, one team

The fix for high variance is to stop judging actions by the raw episode return and start judging them against an expectation. We keep the policy network — call it the actor, because it chooses what to do. Then we add a second network, a critic, that learns a value function: an estimate of how much return we typically expect from a given state. This is the same value idea from the Q-learning guide, now playing a supporting role rather than running the show.

Now the actor's update gets sharper. Instead of asking "was the whole episode good?", it asks "did this action do *better than the critic expected* from here?" That difference — actual outcome minus the critic's prediction — is the advantage. An action with positive advantage beat expectations and gets reinforced; negative advantage means it underperformed and gets pushed down. Subtracting the critic's estimate acts as a baseline: it cancels out the bulk of the noise without biasing which way we move, so the gradient steadies dramatically.

One more honest detail: actor-critic is typically [[on-policy-off-policy|on-policy]]. The critic is judging the actor's *current* behavior, so once the actor changes, old experience is stale and usually gets thrown away. That is the opposite of the deep Q-network you may have met, which is off-policy and can replay old memories for ages. On-policy learning is more stable to reason about but thirstier for fresh data — a trade you accept to gain a directly trainable policy.

PPO: don't take steps you'll regret

Policy gradients have a nasty failure mode. Because each update is estimated from noisy, sampled experience, one over-eager step can shove the policy somewhere terrible. And unlike supervised learning, you cannot just step back — the agent now collects its *next* batch of data using the broken policy, so a single bad update can poison everything that follows. The agent can collapse and never recover. The whole art of stable policy learning is taking steps big enough to make progress but small enough to never fall off a cliff.

[[proximal-policy-optimization|Proximal Policy Optimization]], or PPO, is the pragmatic answer that won the field. It is an actor-critic method with one disciplined rule: never let a single update change the policy too much. PPO looks at the ratio between the new action probabilities and the old ones, and *clips* that ratio — if an update tries to make some action far more or far less likely than before, the clip caps the reward for doing so. The policy is free to improve, but only within a trust region close to where it already was. Stay proximal; that is the whole name.

Run the current policy to collect a batch of experience, recording states, actions, and rewards.
Have the critic estimate the value of each state and compute the advantage of each action taken.
Update the actor toward higher-advantage actions — but clip the probability ratio so no step strays too far.
Update the critic to predict returns more accurately; discard the now-stale batch and loop.

Where this shows up — and what it isn't

Policy methods are why reinforcement learning reaches beyond board games into messy continuous control: robot locomotion, drone flight, dexterous hands. Game-playing systems in the AlphaGo lineage combine policy networks with search and self-play, and their successors lean heavily on policy-gradient training. You will also recognize PPO in a place you might not expect: it is the workhorse optimizer inside RLHF, the technique used to align large language models to human preferences. The "policy" there is the language model itself, and the "reward" is a learned model of what people prefer.

Stay honest about the limits. None of this is the agent "understanding" its task; it is gradient descent on a reward signal, and it will chase that signal with zero common sense. If the reward is even slightly misspecified, a policy learner will happily exploit the gap — racking up points in ways you never intended. That failure has a name you will meet again, reward hacking, and it is the practical, concrete root of most talk about RL "alignment." The problem is not a scheming intelligence; it is a literal optimizer doing exactly what you measured rather than what you meant.

Step back and the map is clean. Value methods learn what is good and act greedily; policy methods learn how to act directly; actor-critic fuses the two; and PPO makes the fusion stable enough to ship. With this, you have the conceptual core that powers nearly all of deep RL — the final guide in this rung can now show these pieces assembled into the systems that beat humans at their own games.