The Change-of-Variables Formula

From the cdf method to a shortcut

In the previous guide you learned the cdf method: to find the distribution of Y = g(X), you write P(Y <= y), translate it into a statement about X, read off F_X, and then differentiate to get the density. That recipe never fails. But it can feel like a lot of bookkeeping when g is something tame like Y = 2X + 1 or Y = X^2, and after a few examples you start to notice the same pattern repeating. The change-of-variables formula is that pattern, packaged so you can skip straight to the answer.

The formula applies in a specific friendly case: X is continuous with density f_X, and g is a smooth, strictly monotone (one-to-one) function on the support of X. "Strictly monotone" means g is always increasing or always decreasing, so that each y comes from exactly one x. Many of the transformations you care about — scaling, shifting, taking a log, taking an exponential — are like this on a suitable interval. When g folds two x-values onto one y (as Y = X^2 does, sending both +2 and -2 to 4), the simple formula does not apply as stated, and we handle that at the end.

The formula, and why the derivative shows up

Here is the statement. If Y = g(X) with g smooth and strictly monotone, let x = h(y) be the inverse function, so h undoes g. Then the change-of-variables formula says the density of Y is f_Y(y) = f_X(h(y)) * |h'(y)|. In words: to get the height of the new density at y, take the old density at the matching point x = h(y), and then multiply by the absolute value of the derivative of the inverse. That extra factor |h'(y)| is the whole story; everything subtle lives there.

Why is that factor there? Because a density is probability per unit length, and a transformation changes how much length you are spreading the probability over. Picture a thin sliver of x-values of width dx; it holds probability about f_X(x) * dx. The transformation maps that sliver to a sliver of y-values of width dy, and those two slivers must hold the same probability — no probability is created or destroyed by relabeling. So f_Y(y) * dy = f_X(x) * dx, which rearranges into f_Y(y) = f_X(x) * |dx/dy|. The derivative |dx/dy| = |h'(y)| is precisely the local stretch factor: how much one unit of y-length corresponds to in x-length.

A worked example you can check by hand

Let X be uniform on [0, 1], so f_X(x) = 1 for 0 <= x <= 1 and 0 elsewhere. Define Y = -ln(X) / 2 — a smooth, strictly decreasing transformation that maps the interval (0, 1] onto [0, infinity). We expect Y to be an exponential-type variable; let us confirm it and pin down the rate. First the inverse: solving y = -ln(x)/2 for x gives x = h(y) = e^(-2y). Its derivative is h'(y) = -2 * e^(-2y), so |h'(y)| = 2 * e^(-2y).

Now plug into the formula. Since f_X equals 1 on its support, f_Y(y) = f_X(e^(-2y)) * |h'(y)| = 1 * 2 * e^(-2y) = 2 * e^(-2y) for y >= 0. That is exactly the density of an exponential variable with rate lambda = 2. So squeezing a uniform variable through Y = -ln(X)/2 produces an exponential distribution — a fact you will use again when you study how to simulate random variables. You can sanity-check the answer without the formula at all: it should integrate to 1, and indeed the integral of 2 * e^(-2y) from 0 to infinity equals 1.

Check that g is smooth and strictly monotone on the support of X (if it folds, split the range into monotone pieces — see the last section).
Invert: solve y = g(x) for x to get x = h(y), and note the new range of y values.
Differentiate the inverse to get h'(y), then take the absolute value |h'(y)|.
Write f_Y(y) = f_X(h(y)) * |h'(y)|, valid on the new range and zero outside it.
Sanity-check by confirming f_Y integrates to 1 over its range.

The linear case, and a warning about the support

The simplest and most common transformation is linear: Y = aX + b with a not equal to 0. The inverse is x = (y - b)/a, whose derivative is 1/a, so |h'(y)| = 1/|a|. The formula gives f_Y(y) = f_X((y - b)/a) * (1/|a|). This single line explains a fact you have leaned on for a while: a constant shift b just slides the density sideways without changing its shape, while a scale factor a stretches it horizontally by |a| and — to keep the total area equal to 1 — squashes its height by 1/|a|. Stretch wider, get shorter; that trade-off is forced by probability conservation.

Apply this to X ~ Normal(mu, sigma^2) and Y = (X - mu)/sigma. Here a = 1/sigma and b = -mu/sigma, and the algebra collapses the bell curve onto the standard normal density. That is the change-of-variables formula quietly justifying the z-score standardization you met earlier — it is not magic, just the linear rule applied to a Gaussian.

When g is not one-to-one, and the leap to many dimensions

What about Y = X^2, where both +x and -x land on the same y? The clean formula does not apply directly, because each y has two pre-images. The honest fix is to split the domain into pieces where g IS one-to-one, apply the formula on each piece, and add the contributions. For Y = X^2 the two branches give f_Y(y) = [f_X(sqrt(y)) + f_X(-sqrt(y))] * (1/(2*sqrt(y))) for y > 0. The factor 1/(2*sqrt(y)) is just the |h'(y)| from x = sqrt(y); the sum over branches is the new ingredient. This is exactly the kind of folding case where the cdf method from the previous guide is often the safer route, and the two methods must agree.

The real reason this formula matters so much, though, is that it generalizes to several variables — and there the cdf method becomes genuinely painful. If you transform a pair (X1, X2) into (Y1, Y2), the single derivative |h'(y)| is replaced by the absolute value of the determinant of a matrix of partial derivatives, the Jacobian. The idea is identical: the Jacobian determinant measures how much the transformation stretches or squashes a tiny patch of area (or volume), and probability conservation forces the density to scale by that factor. We meet this multivariate transformation rule properly later; for now just hold onto the picture that the Jacobian is the multi-dimensional version of |dx/dy|.

One last connection. The whole point of a change of variables is to describe the pushforward — the distribution g "pushes" the law of X forward into. A famous payoff is the Box-Muller transform, which feeds two independent uniforms through a clever two-variable map and pushes them forward into two independent standard normals; the Jacobian is what makes that work out to exactly the Gaussian density. Keep in mind the honest boundary of the tool: it is a continuous-variable formula. Probability mass on discrete variables does not get stretched the way density does — you would just relabel and re-sum the masses — and remember that throughout, a density is not a probability; the |h'(y)| factor only makes sense because we are converting between densities, not probabilities.