Gradient, Directional Derivative & the Hessian

From partial derivatives to a single arrow

Volume I gave you the partial derivative: freeze every variable but one and differentiate as if you were back in single-variable calculus. For a function f(x, y) the partials d f/d x and d f/d y measure the slope of the surface as you walk due east and due north. The trouble is that east and north are just two of infinitely many directions you could walk, and a hillside does not care which way your map happens to point. The natural next question is the honest one: starting at a point, how steep is the ground in an ARBITRARY direction? Answering that pulls the two partials together into one object that points the way the function most wants to go.

Collect the partials into a vector and you have the [[gradient-steepest-ascent|gradient]], written nabla f = (d f/d x, d f/d y, ...). Here nabla is the del operator — read it as the instruction 'take all the first partials and stack them as a vector'. In Volume I you met nabla f as a list of numbers; the leap now is to take its direction and length seriously as geometry. The gradient is not just bookkeeping for the partials. It is an arrow that lives at each point of the domain, and where it points and how long it is will turn out to carry the entire first-order story of the function.

To make 'steepness in an arbitrary direction' precise, pick a unit vector u — a direction with length exactly one — and ask for the rate of change of f as you step along u. That rate is the [[calc-directional-derivative|directional derivative]], written D_u f. You can define it the way Volume I defined any derivative: as a limit, D_u f = lim as h -> 0 of (f(point + h u) - f(point)) / h. This is exactly the slope you would read off if you sliced the surface with a vertical plane running along the direction u and looked at the curve where the plane cuts the hill. Walk east and u = (1, 0) recovers d f/d x; walk north and u = (0, 1) recovers d f/d y. The partials are simply two special directional derivatives, and now we have the rest.

The gradient points the steepest way uphill

Here is the central identity, and it is one of the most useful equations in all of applied mathematics: for a differentiable function the directional derivative is just the dot product of the gradient with your chosen direction, D_u f = nabla f . u. That single formula is the whole engine. Walking along the axis directions reproduces the partials; walking along any other u, you just project the gradient onto u. Computing infinitely many slopes collapses to one arrow and one dot product. (One honest caveat: this clean formula assumes f is genuinely differentiable at the point, not merely that the partials exist — a function can have both partials yet still tear or kink along some diagonal, in which case the dot-product rule fails. The earlier guide on differentiability in several variables is where that distinction lives.)

Now squeeze geometry out of the dot product. Write nabla f . u = |nabla f| |u| cos(theta), where theta is the angle between the gradient and your direction. Since u is a unit vector, |u| = 1, so D_u f = |nabla f| cos(theta). The only thing you control is theta — and cos(theta) is largest, equal to 1, exactly when theta = 0, that is when you walk in the SAME direction as the gradient. So the gradient points in the direction of steepest ascent, and the steepest slope you can find anywhere at that point is precisely |nabla f|, the gradient's length. Turn around (theta = 180 degrees, cos = -1) and you get steepest DESCENT with slope -|nabla f| — pointing down the hill the fastest. This is exactly why the workhorse algorithm of machine learning is called gradient descent: to go downhill as fast as possible, step opposite the gradient.

And there is a third direction worth its own line: when theta = 90 degrees, cos(theta) = 0, so D_u f = 0. Walking exactly perpendicular to the gradient, the function does not change at all to first order — you are walking along the hillside at constant height, neither climbing nor descending. That set of directions of zero change is the tangent to a contour line, which leads straight to the next idea.

Perpendicular to the level sets

A level set (a contour, an isoline, an equipotential) is the set of points where f keeps a single constant value — the closed loops on a topographic map, each one tracing a fixed elevation. Pin down one fact and a great deal of geometry follows: the gradient is everywhere perpendicular to the level set through that point. The argument is the very thing we just noticed. If you move along a level set, f stays constant by definition, so its rate of change in that direction is zero, so D_u f = nabla f . u = 0 for any direction u tangent to the contour. A nonzero gradient whose dot product with every tangent direction vanishes must be orthogonal to all of them — it sticks straight out of the contour. The gradient is the contour's normal.

This is the secret behind the whole language of contour maps, and it is worth letting it sink in. Where contour lines crowd close together the gradient is long (the ground is steep — a small horizontal step crosses many elevation levels); where they spread far apart the gradient is short (the ground is gentle). The gradient always crosses the contours at a right angle, which is why a marble released on a hill rolls perpendicular to the contour lines, and why a river, seeking steepest descent, cuts across them rather than running along them. The same fact upgrades to higher dimensions verbatim: for a function f(x, y, z), the level SETS are surfaces, and nabla f is the normal vector to that surface — the cleanest way there is to find the tangent plane to a surface defined implicitly by f(x, y, z) = constant, which is exactly what the implicit function theorem (the next guide) builds on.

Second order: the Hessian and the bowl

The gradient is the first derivative; to understand a point fully you need the second. Recall the single-variable Taylor expansion from Volume I: f(a + h) is approximately f(a) + f'(a) h + (1/2) f''(a) h^2, the value plus a linear tilt plus a quadratic curve. The multivariable version keeps exactly this shape, but each derivative grows up. The constant stays a number. The first-order term becomes the gradient dotted into the step: nabla f . h. And the quadratic term needs a new creature, because in several variables the second derivative is not one number but a whole grid of them — every partial of every partial. That grid is the [[calc-hessian-matrix|Hessian matrix]] H, the matrix of second partial derivatives, H = [d^2f/dx^2, d^2f/dxdy; d^2f/dydx, d^2f/dy^2] in two variables.

Single variable     f(a + h) = f(a)  +  f'(a) h        +  (1/2) f''(a) h^2  + ...

Several variables   f(a + h) = f(a)  +  (nabla f . h)   +  (1/2) h^T H h     + ...
                               value      gradient term       Hessian term
                                          (1st derivative)    (2nd derivative)

Hessian   H = [ f_xx   f_xy ;        gradient   nabla f = ( f_x , f_y )
                f_yx   f_yy ]        ( H is symmetric:  f_xy = f_yx  when f is C^2 )

h^T H h  is a number: it measures the curvature of f felt along the step h.

The first- and second-order Taylor expansions placed side by side: the gradient replaces f', and the Hessian replaces f''. The quadratic term h^T H h is the step h fed through the Hessian to produce a single number — the curvature seen looking in direction h.

Now read the second-order Taylor expansion f(a + h) is approximately f(a) + nabla f . h + (1/2) h^T H h and notice how cleanly each piece does its job. The constant says where you are; the gradient term tilts the approximating surface (it is the tangent plane you already know from linear approximation); and the Hessian term h^T H h bends it into a bowl, a dome, or a saddle. The notation h^T H h means: take your step h, multiply by the Hessian, then dot with h again — the result is a single number, the curvature the function shows when you look along h. One subtlety worth flagging honestly: for a twice-continuously-differentiable function the mixed partials are equal, d^2f/dxdy = d^2f/dydx (Clairaut's theorem), so the Hessian is symmetric. That symmetry is not decoration — it is the reason the Hessian has the clean spectrum of curvatures we are about to use.

Reading curvature: peak, pit, or pass

Here is where the Hessian earns its keep. At a stationary point — a point where nabla f = 0, the multivariable analogue of a place where f'(x) = 0 — the gradient term vanishes, and the entire local shape is decided by the quadratic term (1/2) h^T H h. So the question 'is this a maximum, a minimum, or neither?' becomes a question purely about the Hessian: is h^T H h always positive, always negative, or does it change sign as you swivel the direction h around? That property of a symmetric matrix is its definiteness. If h^T H h > 0 for every nonzero h, the Hessian is positive-definite: the surface curves upward in every direction, the point is a bowl, a local minimum. If it is negative-definite (curves down every way), a local maximum. If h^T H h is positive in some directions and negative in others, the point is a [[saddle-point|saddle point]] — uphill one way, downhill the cross way, like a mountain pass or the seat of a saddle.

How do you test definiteness without checking infinitely many directions h? Because the Hessian is symmetric, it has real eigenvalues, and the eigenvalues ARE the curvatures along the special perpendicular directions where the bowl is purest. The sign rule could not be cleaner: all eigenvalues positive means positive-definite (minimum), all negative means a maximum, mixed signs means a saddle. For a 2-by-2 Hessian this reduces to the familiar second-derivative test: compute the determinant D = f_xx f_yy - (f_xy)^2. If D > 0 and f_xx > 0 you have a minimum, if D > 0 and f_xx < 0 a maximum, and if D < 0 a saddle. That determinant is just the product of the two eigenvalues, so D < 0 means they have opposite signs — which is exactly the saddle condition in disguise.

Putting the three together

Step back and see how tightly these three objects interlock, because that coherence is the real prize of the chapter. The gradient is the function's first derivative — one arrow that encodes every directional slope through a dot product, points the steepest way up, and stands perpendicular to the level sets. The Hessian is the function's second derivative — one symmetric matrix that encodes every directional curvature through h^T H h and, at a stationary point, decides peak versus pit versus pass. The directional derivative is the bridge: it is how you read a single number — a slope, or with the Hessian a curvature — out of these whole-derivative objects in any direction you please. First derivative, second derivative, and the recipe for sampling them: the entire local geometry of a multivariable function in three companion ideas.

Compute the gradient nabla f, the vector of first partials. Its direction is steepest ascent, its length |nabla f| is the steepest slope, and it is normal to the level set through the point.
For the slope in any chosen unit direction u, take the dot product D_u f = nabla f . u; it equals |nabla f| cos(theta), maximal along the gradient and zero across it.
To classify a stationary point (where nabla f = 0), form the Hessian H of second partials and test its definiteness via the eigenvalue signs, or for 2-by-2 via D = f_xx f_yy - (f_xy)^2.
Read the answer: all-positive curvature is a minimum, all-negative a maximum, mixed signs a saddle; a zero eigenvalue makes the test inconclusive, so look to higher-order terms.

This trio is not the end of the story but its launching pad. The very next guide takes the gradient-as-normal idea and the local linear picture and proves you can solve f(x, y) = constant for one variable in terms of the others — the implicit and inverse function theorems. The next rung, multivariable optimization, lives entirely here: setting nabla f = 0 to find candidates, reading the Hessian to sort them, and following -nabla f downhill in gradient descent and steepest descent to actually reach a minimum. Everything from training a neural network to shaping a wing to fitting data by least squares is, underneath, a machine that computes gradients and inspects Hessians. You have just built that machine's two gears and the shaft that connects them.