Joint and Marginal Distributions

Why one variable is rarely the whole story

In the rungs behind you, a single random variable X carried its own complete description — a pmf if discrete, a density if continuous, and a cdf either way. But the interesting questions in the world almost always involve two or more numbers read off the *same* experiment. Height and weight of the same person. The high and the low temperature on the same day. The number of customers who walk in and the number who buy. When two quantities come from one experiment, asking about each alone throws away the most useful thing of all: how they move together.

The object that keeps every bit of that information is the joint distribution of the pair (X, Y). Think of it as a single experiment that produces *two* numbers at once, and the joint distribution as the full rulebook for that pair. This guide is the foundation for the entire rung: independence, covariance, correlation, and the variance of a sum are all just questions you ask *of* the joint distribution. Get this picture clean and the rest of the rung becomes reading the same object from different angles.

The joint pmf: a table, not a list

Start discrete, where you can see everything. For a pair (X, Y) of discrete variables, the joint pmf is the function p(x, y) = P(X = x and Y = y) — the probability that X lands on x *and* Y lands on y at the same time. Where a single variable's pmf was a one-row list of point-masses, a joint pmf is a two-dimensional table: rows for the values of X, columns for the values of Y, and one probability sitting in each cell. Like any probability assignment, the entries are nonnegative and they must add up to 1 across the whole table.

Here is a tiny concrete one. Flip a fair coin twice; let X be the number of heads on the first flip (0 or 1) and Y the total number of heads over both flips (0, 1, or 2). Each of the four outcomes HH, HT, TH, TT has probability 1/4, and we just sort them into the right cell. The table below shows where each quarter lands. Notice that p(0, 2) = 0: you cannot get a total of two heads if the first flip was a tail — the joint distribution records that impossibility as an honest zero in the cell.

p(x,y)        Y=0     Y=1     Y=2   | row sum  P(X=x)
-----------  -----   -----   ----- | -----------------
X=0 (tail)    1/4     1/4      0    |    1/2
X=1 (head)     0      1/4     1/4   |    1/2
-----------  -----   -----   ----- |
col sum:      1/4     1/2     1/4   |    1   (grand total)
P(Y=y)

The joint pmf of (X, Y) for two coin flips. The interior is the joint distribution; the margins of the table are literally the marginal distributions of X and Y.

Marginals: squeezing the table flat

Suppose you only care about X and want to forget Y entirely. The distribution of X on its own is called its marginal distribution, and the name is wonderfully literal: it comes from writing the row sums in the *margin* of the table. To get P(X = x), you add up the whole row for that x — you sum over every possible value of Y, because X = x can happen alongside Y being anything. In our table, P(X = 0) = 1/4 + 1/4 + 0 = 1/2 and P(X = 1) = 0 + 1/4 + 1/4 = 1/2. The column sums give the marginal of Y the same way: P(Y = 0) = 1/4, P(Y = 1) = 1/2, P(Y = 2) = 1/4.

The rule "sum out the variable you do not want" is the whole idea of a marginal, and it is the discrete cousin of integration. For continuous variables with a joint density f(x, y), you get the marginal density of X by integrating Y away: f_X(x) = the integral of f(x, y) over all y. Sometimes this is called *marginalizing out* Y. Either way the geometry is the same picture — you are collapsing a two-dimensional landscape of probability down onto one axis, letting the other axis pile up wherever it lands.

Conditionals: slicing instead of squeezing

A marginal squeezes the whole table flat. A conditional distribution does the opposite: it picks out a single slice and zooms in. The question "given that Y = 1, how is X distributed?" means: look only at the column Y = 1, then rescale that column so its entries add to 1 again. This is exactly the conditional probability you already know, applied value by value: P(X = x given Y = y) = P(X = x and Y = y) / P(Y = y), which in table language is just "the cell divided by its column sum."

Fix the condition: we are told Y = 1, so look only at the Y = 1 column. Its cells are p(0,1) = 1/4 and p(1,1) = 1/4.
Find the column total, which is the marginal P(Y = 1) = 1/4 + 1/4 = 1/2. This is the new "whole world" we are living in.
Rescale each cell by that total: P(X = 0 given Y = 1) = (1/4)/(1/2) = 1/2 and P(X = 1 given Y = 1) = (1/4)/(1/2) = 1/2.
Check it is a genuine distribution: the rescaled column adds to 1/2 + 1/2 = 1. Given one head in total, the first flip was equally likely heads or tails.

The same recipe works for continuous variables, where it gives the conditional density f(x given y) = f(x, y) / f_Y(y) — the joint density along the slice y, divided by the marginal value at that y so the slice integrates to 1. Conditioning is the engine of prediction: knowing Y reshapes what you expect of X, and the average of that reshaped distribution is the conditional expectation E[X given Y], a tool you will lean on heavily when this rung reaches the law of total expectation and the variance of a sum.

How the three views fit together

It pays to see the joint, the marginal, and the conditional as three views of one object rather than three separate ideas. The joint p(x, y) is the master table. A marginal *flattens* it (sum out a variable). A conditional *slices* it (fix one variable, rescale). And they are tied together by the relation joint = conditional × marginal: p(x, y) = P(X = x given Y = y) · P(Y = y). That single identity is just the multiplication rule for probabilities wearing new clothes, and it lets you build a joint distribution out of "the distribution of Y, then the distribution of X given Y" — the natural way to model a chain of cause and effect.

This also previews the headline of the next guide. Sometimes knowing Y tells you nothing new about X — every slice of the table has the same *shape* as the marginal. In that special case the conditional equals the marginal, P(X = x given Y = y) = P(X = x), and the joint factors cleanly as p(x, y) = P(X = x) · P(Y = y). That is exactly independence of random variables, the subject of guide 2. Our coin example is *not* independent: given Y = 0, X must be 0, so learning Y did change what we knew about X. Independence is the rare, clean case where the slices never change shape.

Honest fine print and the road ahead

Two cautions worth carrying forward. First, the continuous warning from the previous rung still bites here, doubled: a joint density f(x, y) is not a probability and can exceed 1; probability is *volume* under the surface over a region, not the height at a point, and any single exact point (x, y) has probability zero. Second, conditioning on Y = y when Y is continuous is delicate, since P(Y = y) = 0 and you cannot literally divide by it — the conditional density is the well-defined limit that repairs this, but it is a genuine subtlety, not an obvious move. Both points are the same old lesson: in the continuous world, probability lives in areas and volumes, never at points.

With the joint, marginal, and conditional firmly in hand, the rest of the rung becomes a tour of questions you ask of one master table. Guide 2 asks when the table *factors* — independence. Guides 3 and 4 ask how to *measure* the dependence with one number, through covariance and the correlation coefficient, and warn that a zero there does not prove independence. Guide 5 uses conditioning to compute the variance of a sum and to state the law of total expectation and variance. Every one of them lives inside the picture you just built: the joint distribution, read by flattening or by slicing.