From a curve to a landscape
A one-variable function y = f(x) draws a curve: one input, one output. Now let two inputs in — say z = f(x, y) — and the graph lifts off the page into a surface, like a rolling hill suspended over the flat (x, y) floor. The height z is how high the ground stands above the point (x, y). Mountains, valleys, ridges, saddles: a function of two variables is literally a landscape.
On a curve there was just one rate of change: the slope. But standing on a hillside, 'the slope' is ambiguous — slope in which direction? Walk east and the ground may rise steeply; turn and walk north and it might be dead flat, or even fall away. The single number from before splits into a whole fan of slopes, one for every direction you could step. To get a grip on this, we start by measuring just two of them.
Partial derivatives: freeze everything but one knob
The trick is beautifully simple: to measure the slope along the x-direction, hold y completely still and let only x vary. With y frozen to a constant, f(x, y) becomes an ordinary one-variable function of x — and you already know how to differentiate that. The result is the partial derivative of f with respect to x, written df/dx with a curly d, or more usually the special symbol shown below. It answers: as I take one small step east while staying on the same north-south line, how fast does the height change?
f(x, y) = x^2 + 3*x*y + y^2
partial f / partial x : treat y as a constant
-> 2x + 3y (the y^2 term has no x, so it vanishes)
partial f / partial y : treat x as a constant
-> 3x + 2y (the x^2 term has no y, so it vanishes)Geometrically, partial f / partial x is the slope of the surface measured along a slice cut parallel to the x-axis — you sliced the hill with a vertical plane and read the slope of the curve exposed on the cut. There is nothing new to learn about the differentiation itself: every derivative rule you have practiced (power rule, product rule, chain rule) applies unchanged. The only discipline is remembering which letters are 'live' and which are temporarily frozen.
The gradient: bundle the slopes into one arrow
You now have two numbers at every point: the east-west slope and the north-south slope. The gradient simply collects them into a single vector, written grad f or with the upside-down triangle: grad f = ( partial f / partial x , partial f / partial y ). It is not just bookkeeping. That little arrow carries a remarkable piece of information about the landscape.
Here is the punchline: at any point, the gradient arrow points in the direction of steepest ascent — the single compass bearing in which the surface climbs fastest — and its length tells you how steep that steepest climb is. Step exactly along the gradient and you gain height most quickly; step exactly opposite it and you descend fastest; step at a right angle to it and you stay level, which is why the gradient is always perpendicular to the contour lines from earlier. One arrow encodes both the best direction and the rate.
f(x, y) = x^2 + y^2 (a bowl, lowest at the origin)
grad f = ( 2x , 2y ) (the two partials, bundled)
at (3, 4): grad f = (6, 8)
direction = straight away from the center (steepest uphill)
length = sqrt(6^2 + 8^2) = sqrt(100) = 10 (the steepness there)Why machines love the gradient
Most of the time we want to go down, not up: find the lowest point of a cost, an error, a loss. Since the gradient points to the steepest uphill, its opposite points to the steepest downhill. So a blindfolded hiker can find a valley by a stupidly simple rule — feel which way is steepest down (the negative gradient), take a small step that way, and repeat. This is gradient descent, and it is the workhorse behind training nearly every modern machine-learning model.
- Start anywhere on the landscape — pick a guess for the inputs (in machine learning, the model's initial weights).
- Compute the gradient there — the partials of the loss with respect to every input. This is the uphill direction.
- Step against it: new position = old position - (step size) * gradient. The small step size (the 'learning rate') keeps you from overshooting.
- Repeat until the gradient shrinks toward zero — flat ground, meaning you have settled into a valley (a minimum, at least a local one).