Monte Carlo Simulation

The big idea: hard math, easy averaging

Suppose you must compute the average of some quantity g(X) where X is random — say the expected payoff of a complicated bet, or an integral that no formula will crack. The previous guide showed how a Bayesian posterior is often known only up to a messy constant, and that integrals against it are exactly the things we cannot do by hand. Monte Carlo simulation answers with a wonderfully blunt strategy: instead of solving the integral, draw many random samples and average. If you cannot do the calculus, let the dice do it for you.

Here is the engine, and it is one you already own from earlier rungs. By the law of the unconscious statistician, the thing we want is an expectation: E[g(X)] is the true average of g(X) over the randomness of X. By the [[strong-law-of-large-numbers|strong law of large numbers]], if you draw independent samples X_1, ..., X_N from the distribution of X and average g over them, that sample average converges to the true expectation as N grows. So the recipe is: sample, apply g, average. The estimator is the empirical mean (g(X_1) + ... + g(X_N)) / N, and it homes in on E[g(X)] almost surely.

Why "Monte Carlo"? The name is a wartime in-joke: Stanislaw Ulam and John von Neumann, working at Los Alamos in the 1940s on neutron diffusion, needed a code word for their secret method of solving physics by random sampling, and Ulam's uncle liked to gamble at the Monte Carlo casino. The flippant name stuck to one of the most useful computational ideas of the century. Every time a quantity is an expectation in disguise, Monte Carlo can attack it.

A concrete picture: estimating pi by throwing darts

The classic toy makes the whole idea visible. Draw a square of side 2 centred at the origin, and inscribe a circle of radius 1. The square has area 4 and the circle has area pi, so the fraction of the square covered by the circle is pi / 4. Now throw darts uniformly at random into the square: each dart's chance of landing inside the circle is exactly that fraction, pi / 4. This is just geometric probability turned into an experiment we can run.

So generate N points (x, y) with x and y each uniform on [-1, 1], count how many satisfy x^2 + y^2 <= 1, and form the fraction p-hat = (hits) / N. By the law of large numbers p-hat tends to pi / 4, so 4 * p-hat is a Monte Carlo estimate of pi. Throw 1000 darts and you might get something like 3.10 or 3.18; throw a million and you will reliably see 3.14-ish. Notice what just happened: a question about the constant pi — pure deterministic mathematics — got answered by counting random events. That reversal is the whole magic of the method.

fraction inside circle  =  (area of circle) / (area of square)  =  pi / 4

Monte Carlo estimator   :  p-hat   = (hits) / N
                           pi-hat  = 4 * p-hat   ->  pi   as N -> infinity

The dart-throwing estimate of pi: a geometric ratio becomes a count, and the count becomes a number.

How fast does it converge? The square-root law

Knowing the estimate is correct "in the limit" is not enough — we need to know how close it is after a finite N. Here the [[prob-central-limit-theorem|central limit theorem]] answers precisely. The Monte Carlo estimator is an average of N independent copies of g(X), so if g(X) has variance sigma^2, the estimator has variance sigma^2 / N. Its standard error — the typical size of the error — is therefore sigma / sqrt(N). The error shrinks, but only like one over the square root of N.

That square-root law is the central fact of practical Monte Carlo, and it is sobering. To halve the error you must quadruple N; to add one more decimal digit of accuracy (a tenfold error reduction) you need a hundredfold more samples. This is why our dart estimate of pi wobbles around 3.1 even at N = 1000: with the binomial variance here, the standard error is roughly 0.05, so being off by 0.04 is utterly ordinary. The flip side is genuinely remarkable, though: that sigma / sqrt(N) rate does not contain the number of dimensions at all. Classical grid-based integration needs exponentially many points as dimensions grow, but Monte Carlo's error rate is the same in 2 dimensions or 2000. That dimension-independence is why it dominates high-dimensional problems.

Where do the random numbers come from?

Monte Carlo needs a fountain of samples, but a computer is deterministic and cannot truly flip coins. The standard trick is to start from one workhorse: a pseudo-random number generator that produces a long stream of values that look uniform on [0, 1] and statistically independent, even though they are generated by a fixed formula from a seed. (Because the seed and the formula are fixed, the stream is exactly reproducible — a feature, since it lets you re-run a simulation and debug it.) From that uniform stream, every other distribution is manufactured.

The bridge from uniform to anything else is the [[inverse-transform-method|inverse transform method]], which falls straight out of the probability integral transform you met earlier. If F is the cumulative distribution function you want and U is uniform on [0, 1], then F-inverse(U) — feeding a uniform number through the inverse CDF — has exactly the distribution F. For an exponential distribution with rate lambda, F(x) = 1 - e^(-lambda x), and inverting gives the tidy rule X = -ln(U) / lambda: a single uniform draw becomes an exponential draw. For distributions whose CDF cannot be inverted in closed form — the normal is the famous case — there are dedicated tricks like the Box-Muller transform, which turns two uniforms into two independent standard normals.

Making it sharper, and the bridge to statistics

Because the error is sigma / sqrt(N), there are two levers: raise N or lower sigma. Raising N just costs computer time; the cleverer art is [[variance-reduction|variance reduction]] — redesigning the estimator so the same N buys a smaller sigma. A simple example is antithetic sampling: for each uniform U also use its mirror 1 - U; when g is monotone these paired estimates are negatively correlated, so their average has lower variance than two independent draws. Other schemes (control variates, importance sampling, stratification) all chase the same goal: keep the unbiasedness, shrink the spread. A well-chosen variance-reduction scheme can be worth a hundredfold increase in N for free.

Write your quantity as an expectation E[g(X)] for some random X and function g — the modelling step.
Draw N independent samples X_1, ..., X_N from the distribution of X (uniforms via the generator, then the inverse transform or a dedicated method).
Average: the estimate is (g(X_1) + ... + g(X_N)) / N, which converges to E[g(X)] by the law of large numbers.
Report uncertainty: estimate sigma from the sample and quote the standard error sigma / sqrt(N) (a confidence band by the CLT). Never hand over a Monte Carlo number with no error bar.

Monte Carlo also quietly reverses the usual flow of probability and so opens the door to statistics. Normally we know a model and compute outcomes; here we run the model forward many times to learn properties we could not derive. The very same move powers the [[bootstrap|bootstrap]], which the From-Probability-to-Statistics guide develops: treat your one dataset as a population, resample from it thousands of times, and read off the variability of a statistic directly. And when even drawing independent samples from a target is impossible — the typical fate of a complicated Bayesian posterior — the next guide introduces Markov chain Monte Carlo, which builds a clever random walk whose long-run visiting frequencies match the distribution you cannot sample directly. Plain Monte Carlo is the foundation that all of those rest upon.