The Multivariable Chain Rule

One walker, many routes for change

Recall the single-variable chain rule from Volume I: if y depends on u and u depends on x, then dy/dx = (dy/du)(du/dx). It says change travels down a chain, and the rates multiply — a gear train, where turning the input spins the output by the product of the gear ratios. That picture is complete when there is exactly one path from x to y. The whole story of this guide is what happens when there is more than one path, because in several variables there almost always is.

Here is the scene that makes it concrete. Let T(x, y) be the temperature at each point of a room — a function of two variables. Now you walk a path, so your position (x(t), y(t)) depends on time t. As you stroll, the temperature you feel, call it T(x(t), y(t)), is now a plain function of t alone. Question: how fast does the felt temperature change, dT/dt? Two things are happening at once. As t ticks forward you slide a little in x, and the temperature responds through its sensitivity in x, namely the partial derivative partial T / partial x. But you also slide a little in y, and the temperature responds through partial T / partial y. Both nudges happen in the same instant, and both feed into the one number dT/dt.

So the answer cannot be a single product; it must be a sum of products, one product for each variable that carries change from t to T: dT/dt = (partial T / partial x)(dx/dt) + (partial T / partial y)(dy/dt). Each term reads exactly like the single-variable chain rule — sensitivity of T to that variable, times the speed of that variable — and we simply add the contributions because change can reach T by either route. That single line is the whole multivariable chain rule in its commonest form. Everything else is bookkeeping for more elaborate webs of dependence.

Why a sum: it is just the gradient meeting the velocity

Where does the sum-of-products really come from? From the deepest fact of this rung: near a point, a differentiable function is well approximated by a linear map, the total derivative. For T at a point, that linear approximation says a small displacement (dx, dy) changes T by approximately dT = (partial T / partial x) dx + (partial T / partial y) dy. This object dT is the total differential: it bundles both partial sensitivities into one honest first-order estimate of the change. The multivariable chain rule is nothing more than this estimate, divided through by dt and made exact in the limit.

There is an even cleaner way to see the same line. Collect the two partials into the gradient vector nabla T = (partial T / partial x, partial T / partial y), and collect the two speeds into the velocity vector v = (dx/dt, dy/dt). Then dT/dt is exactly the dot product nabla T . v. The chain rule is the gradient dotted with the velocity: how steeply the landscape rises, projected onto the direction you are actually moving. This is also why the directional derivative — the rate of climb in a chosen direction — is nabla T dotted with a unit direction. Same machinery, dressed for a different question.

The tree diagram: bookkeeping you can trust

Once the web of dependence has more than two layers, it becomes dangerously easy to drop a term. The tree diagram is a small drawing that guarantees you write down exactly the right sum — it is bookkeeping, not new theory. You draw the final quantity at the top, branch down to each variable it depends on directly, then branch again to whatever those depend on, and so on, until you reach the variable you are differentiating with respect to. Every line you draw is a partial (or ordinary) derivative of the thing above it with respect to the thing below it.

Then two rules harvest the answer, and they are unforgettable once stated. Multiply along each path from top to bottom. Add over all the distinct paths. That is the entire algorithm: every route from the final quantity down to the variable contributes one product of the derivatives strung along it, and the total derivative is the sum of those products. The temperature example has two paths from T down to t (through x, and through y), giving the two-term sum we already wrote. Branch counting replaces memory: you never have to recall how many terms a given setup produces — you read it off the tree.

        w = f(x, y, z)
       /      |       \        <- branches: partial w/partial x, etc.
      x       y        z
     / \     / \      / \
    s   t   s   t    s   t     <- branches: partial x/partial s, etc.

  partial w / partial s
    = (partial w/partial x)(partial x/partial s)
    + (partial w/partial y)(partial y/partial s)
    + (partial w/partial z)(partial z/partial s)

  three top-to-bottom paths reach s through x, y, z -> three terms

A two-layer tree: w depends on x, y, z, each of which depends on s and t. Multiply along each path, add the three paths that reach s. The path through t would build partial w / partial t the same way.

The matrix view: chain rule as multiplying Jacobians

The tree is wonderful for a single output, but when several outputs depend on several inputs the cleanest statement is matrix multiplication. Stack the partial derivatives of a vector map into its Jacobian matrix: one row per output component, one column per input variable. For a map taking (x, y) to (u, v), the Jacobian is [partial u/partial x, partial u/partial y; partial v/partial x, partial v/partial y]. This is just the total derivative of the map written as a grid — the best linear approximation, now for inputs and outputs that are both vectors.

Now the chain rule reads with startling economy: the Jacobian of a composition is the product of the Jacobians. If g maps inputs to middle variables and f maps middle variables to outputs, then the Jacobian of f-after-g at a point is J_f times J_g, an honest matrix multiplication. This is the exact several-variable echo of dy/dx = (dy/du)(du/dx): the scalars have grown into matrices, and ordinary multiplication has grown into matrix multiplication, but the shape of the law is identical. Every entry of the product, written out, is precisely one of the sum-over-paths formulas the tree would hand you — matrix multiplication IS the tree's add-the-paths rule, organized into rows and columns.

Order matters now. Matrix multiplication does not commute, so it must be J_f times J_g — the outer map's Jacobian on the left, the inner on the right — matching the order in which you compose the maps. The single-variable rule let you write the factors in any order because numbers commute; that freedom is gone the moment outputs and inputs are vectors. Keep the matrices in composition order and the dimensions line up automatically: an output-by-middle matrix times a middle-by-input matrix gives an output-by-input matrix, exactly the Jacobian of the whole composition.

The special cases that appear constantly

A handful of shapes recur so often they are worth recognizing on sight. First, the one-ultimate-variable case we opened with: w = f(x, y) with x, y both depending on t gives the ordinary derivative dw/dt = (partial w/partial x)(dx/dt) + (partial w/partial y)(dy/dt). This is the workhorse behind related rates, behind the rate of change of any field along a moving particle, and behind energy methods in physics. Second, the change-of-coordinates case, indispensable for polar, cylindrical, and spherical work: rewriting a function in new variables r, theta means each old partial becomes a chain-rule sum, e.g. partial f/partial r = (partial f/partial x)(partial x/partial r) + (partial f/partial y)(partial y/partial r), which is how the Laplacian and gradient get their curvilinear forms.

Third, and a famous trap: the case where a variable appears both directly and through other variables. Suppose w = f(x, y, t) but x and y also depend on t. Then the rate of change of w as t varies is dw/dt = (partial f/partial x)(dx/dt) + (partial f/partial y)(dy/dt) + partial f/partial t. The last term is the explicit, hold-everything-else-fixed sensitivity to t, and it is easy to forget — the tree saves you, because t appears as its own bottom branch directly under f as well as under x and under y. The notation strains here: dw/dt (the total rate, all routes) is genuinely different from partial f/partial t (just the explicit route). Confusing the two is one of the most common and most consequential errors in all of applied calculus.

Fourth, implicit differentiation reborn. Recall from Volume I how you differentiated an equation like x^2 + y^2 = 1 to find dy/dx without solving for y. The chain rule explains why that worked and generalizes it: if a relation F(x, y) = 0 silently defines y as a function of x, differentiate both sides with respect to x by the chain rule — partial F/partial x + (partial F/partial y)(dy/dx) = 0 — and solve to get dy/dx = -(partial F/partial x)/(partial F/partial y). That this is even legitimate, that the relation really does define y(x) near a point, is the promise of the implicit function theorem, which the next ideas in this rung make precise; the chain rule is the engine that then computes the slope.

A worked walk, end to end

Let us run the temperature walk with real numbers to see the routes add. Take T(x, y) = x^2 y, and walk the path x(t) = cos t, y(t) = sin t — a unit circle, traced counterclockwise. We want dT/dt: how fast the felt temperature changes as we circle. The pieces are partial T/partial x = 2xy and partial T/partial y = x^2, while dx/dt = -sin t and dy/dt = cos t. The walk below assembles them.

Draw the tree: T sits on top, branches down to x and to y, and each of those branches down to the single ultimate variable t. Two top-to-bottom paths reach t — exactly two terms.
Multiply along each path: the x-route gives (2xy)(-sin t); the y-route gives (x^2)(cos t).
Add the paths: dT/dt = -2xy sin t + x^2 cos t.
Substitute the path x = cos t, y = sin t: dT/dt = -2 cos t sin t . sin t + cos^2 t . cos t = -2 cos t sin^2 t + cos^3 t.
Sanity check by the other road: substitute the path FIRST, getting T = cos^2 t . sin t, then differentiate this single-variable function directly. You get the very same -2 cos t sin^2 t + cos^3 t. Two routes, one answer.

That sanity check is worth pausing on, because it reveals the honest status of the chain rule. You can ALWAYS substitute the inner functions first and then differentiate as a single-variable problem — that is not cheating, it is the definition, and it must give the same answer. The chain rule is not a different truth; it is a labor-saving organizer that lets you differentiate without first carrying out the substitution, which is priceless when the substitution is ugly or when you only have numerical values of the partials at a point, not formulas. Keep both roads in mind: the chain rule for speed and structure, direct substitution as the ground truth you can always fall back to.