The Total Derivative & the Jacobian

From slope to best linear map

In Volume I the derivative f'(a) was the slope of the tangent line, a single number. But there is a deeper reading hiding inside that number: f'(a) is the multiplier in the best straight-line approximation of f near a, namely f(a + h) is approximately f(a) + f'(a) h. The honest statement is that the error f(a + h) - f(a) - f'(a) h shrinks faster than h itself as h goes to 0 — that is what "best" means. This linear-approximation viewpoint, not the slope picture, is the one that survives the jump to many variables.

Now let f take a point in the plane and return a number, f(x, y). A line no longer fits — near a point the graph is a surface, and the right approximating object is a flat tangent plane. "Best linear approximation" now means: there is a linear map L (something that takes the displacement vector h = (h1, h2) and returns a number) such that f(a + h) is approximately f(a) + L(h), with the error again vanishing faster than the length of h. That single linear map L is the total derivative. The slope has become a whole linear machine.

Why partial derivatives alone are not enough

A partial derivative like df/dx measures the rate of change of f as you walk in one axis direction only, freezing the others. The previous guide revisited these as the entries of the gradient nabla f = (df/dx, df/dy). It is tempting to declare a function differentiable the moment both partials exist. That is false, and the gap matters: partials probe only the two coordinate directions, while differentiability is a promise about every direction of approach at once.

The classic warning is f(x, y) = xy / (x^2 + y^2) with f(0, 0) defined to be 0. Along the x-axis f is identically 0, so df/dx at the origin is 0; along the y-axis the same, so df/dy is 0 too. Both partials exist and equal 0. Yet along the diagonal line y = x the function takes the value x*x / (2x^2) = 1/2 for every nonzero x — it does not even approach 0 as you slide in toward the origin. A function that fails to be continuous cannot possibly have a tangent plane, so it is not differentiable, even though its partials are perfectly well defined.

The total differential: the gradient does the work

When f(x, y) is differentiable, the mysterious linear map L is not mysterious at all — it has to be built from the partials. The only linear map that matches f's one-direction rates is L(h1, h2) = (df/dx) h1 + (df/dy) h2, which is exactly the dot product of the gradient with the displacement: L(h) = nabla f · h. Writing the small displacements as dx and dy, this is the total differential df = (df/dx) dx + (df/dy) dy. It is the multivariable echo of dy = f'(x) dx.

Picture it concretely. Suppose a metal plate has temperature T(x, y) with nabla T = (3, -2) degrees per centimetre at a point. Step 0.1 cm in x and 0.05 cm in y. The total differential estimates the temperature change as dT = 3(0.1) + (-2)(0.05) = 0.3 - 0.1 = 0.2 degrees. Each partial contributes its own rate times its own step, and they simply add — no cross term, because the linear approximation, by design, ignores how the directions interact. That interaction lives in the second-order terms, which the next guides reach through the Hessian and the second-order Taylor expansion.

Two honest cautions. First, the total differential is an approximation, exact only in the limit of infinitesimal steps; for finite dx, dy there is an error, and it is precisely the higher-order remainder you are dropping. Second, this is also the engine of error propagation in the lab: if you know rough uncertainties dx and dy in your measurements, |df| bounded by |df/dx||dx| + |df/dy||dy| gives a first estimate of the uncertainty in f, which is honest precisely because it is linear and local.

The Jacobian: packing a vector map's partials into a matrix

Step up once more: let the output be a vector too. A map F sends a point (x, y) to a pair (u, v) = (f1(x, y), f2(x, y)) — think of a coordinate change, or a physical transformation that bends one region of the plane onto another. Each component f1, f2 is an ordinary scalar function with its own gradient, hence its own total differential. Stack those gradients as the rows of a matrix and you have built the Jacobian matrix of F, the single object that is the total derivative of a vector-valued map.

F(x, y) = ( f1(x, y), f2(x, y) ).  Its Jacobian matrix:

  J = [ df1/dx, df1/dy ;
        df2/dx, df2/dy ]

Row i = gradient of the i-th output component.
Column j = how every output responds to input x_j.

Worked example:  F(x, y) = ( x^2 - y^2 , 2xy )   (squaring a complex number)

  df1/dx = 2x    df1/dy = -2y
  df2/dx = 2y    df2/dy =  2x

  J = [ 2x, -2y ;
        2y,  2x ]

Local linear model of the map near a:
  F(a + h) is approximately F(a) + J(a) h     (J(a) h is matrix times column vector h)

The Jacobian's rows are gradients; the local model F(a + h) is approximately F(a) + J(a) h is the exact vector analogue of f(a + h) is approximately f(a) + f'(a) h.

Read the matrix two ways. Row i is the gradient of the i-th output, telling you how that one output responds to all the inputs. Column j collects all the df_i/dx_j for a fixed input x_j — how the whole output vector reacts when you nudge that one input. A gradient is just the special one-output case: a 1-by-n Jacobian. A column vector parametrising a curve is the one-input case: an m-by-1 Jacobian. The Jacobian is the genuinely general derivative, and everything you already know is a slice of it.

Why matrices: composition becomes multiplication

Packing the partials into a matrix is not mere tidiness; it is what makes the chain rule beautiful. If you do one map F then another map G, the local linear model of the composition G of F is the composition of their linear models — and composing linear maps is exactly matrix multiplication. So the multivariable chain rule reads J_{G of F} = J_G · J_F, the Jacobians multiplied in order. The single-variable (g of f)' = g'(f(x)) f'(x) is precisely this with one-by-one matrices, where multiplication is ordinary numbers; the matrix form is the same statement grown up.

When F maps from n inputs to n outputs, its Jacobian is square, and its determinant earns its own name: the Jacobian determinant, often written det(J) or d(u,v)/d(x,y). Geometrically it is the local area (or volume) scaling factor of the map: a tiny square of area dA near a point is carried to a tiny parallelogram of area |det J| dA. The sign tells you about orientation — a negative Jacobian determinant means the map flips the region over, like a mirror. This is the very factor that appears when you change variables in a multiple integral.

This single number also decides whether a map is locally reversible. The inverse function theorem says that wherever det J is nonzero, F is invertible in a small neighbourhood, and the Jacobian of the inverse is just the matrix inverse of J — the multidimensional version of (f inverse)' = 1 / f'. Its sibling, the implicit function theorem, uses a nonvanishing sub-Jacobian to guarantee you can solve an equation locally for some variables in terms of the rest, even when no formula exists. Both are local theorems with genuine hypotheses, the topic of the guides ahead.

Putting it together

Decide the shapes: F goes from n inputs to m outputs, so its Jacobian is an m-by-n matrix (rows count outputs, columns count inputs).
Fill entry (i, j) with df_i/dx_j — compute each partial derivative by holding the other inputs fixed, exactly as in Volume I.
Evaluate J at the base point a to get a concrete matrix of numbers; the local model is F(a + h) is approximately F(a) + J(a) h.
If the map is square, take det J: nonzero means locally invertible and area-scaling by |det J|; chaining maps multiplies their Jacobians.

Hold the whole arc in one view. The slope generalised to the total derivative, a best linear map; that map's coordinates, the partial derivatives, line up as the rows of the Jacobian; the Jacobian's determinant measures how the map stretches and orients space. From here the change-of-variables theorem for integrals, the chain rule for any composition, and the inverse and implicit function theorems are all just the Jacobian seen from different angles. One matrix, carrying all the first-order information about a map, is the organising idea of the rest of this rung.