Statistics 1935

The Design of Experiments

Ronald A. Fisher

Randomise the experiment, and let the data risk proving you wrong.

Choose your version

In depth · the introduction

A statistician, a summer afternoon, and a single cup of tea became the moment science learned how to run a fair experiment.

The idea, unpacked

Before Fisher, experiments were often a muddle: change something, see what happens, then argue about whether the result was real. Fisher sharpened the question. Instead of asking 'is my idea true?', he asked 'could pure chance have produced a result this striking?' You begin by stating a deliberately dull claim — the null hypothesis, that there is no real effect and it is all luck — and then build an experiment that gives the facts a fair chance to overthrow it.

His second move was randomisation. By deciding the order and layout of an experiment by chance — a coin, a shuffle — you turn luck from your enemy into your measuring stick. Now you can calculate exactly how often chance alone would fool you, and choose to believe an effect only when it beats those odds.

Where it came from

The scene is the Rothamsted agricultural research station in England in the 1920s. A colleague, the algae scientist Muriel Bristol, insisted that a cup of tea tasted different depending on whether the milk or the tea was poured in first. Fisher thought it nonsense; another scientist present, William Roach, said simply: let's put her to the test.

Fisher had spent years at Rothamsted wrestling messy crop-trial data into shape, inventing the analysis of variance and the rules of good experimental design as he went. In that tea party he saw an entire philosophy of experiment in miniature — and in 1935 he opened his book The Design of Experiments with it.

Why it mattered

It handed every science a shared, honest procedure for telling a real signal from a lucky fluke, and a shared language — null hypothesis, significance, p-value — to argue in. Above all it placed randomisation at the heart of credible evidence. The randomised controlled trial, the reason we can trust that a medicine truly works rather than merely seeming to, is Fisher's cup of tea grown all the way up.

A sharpened coin-toss

Suppose a friend claims they can call a coin before it lands. One correct call proves nothing — anyone is right half the time. But ten correct calls in a row? Chance would manage that only about once in a thousand tries, so now you start to believe them. Fisher's tea test is exactly this, made precise: with eight cups there are seventy ways to split them into two fours, so a perfect sorting happens by luck only one time in seventy — rare enough to take seriously. Be the lady yourself below.

What came before and after

Fisher built on the older mathematics of probability — including bayes-1763 — but turned it in a new, practical direction: not updating beliefs, but designing experiments. His significance testing was soon challenged by Jerzy Neyman and Egon Pearson, whose rival 'error-rate' framework still competes with it today, and by the Bayesians. Yet the randomised experiment he championed became the gold standard of evidence, from medicine to economics. When you read that a finding is 'statistically significant', you are reading Fisher.

The original document

Original source text

R. A. Fisher · The Design of Experiments · Oliver and Boyd, Edinburgh · 1935

Chapter II · The Principles of Experimentation, Illustrated by a Psycho-Physical Experiment

A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.

Fisher sets out the test. Eight cups are prepared — four with the milk poured first, four with the tea poured first — and presented to the lady in random order; she is told that there are four of each kind, and asked to divide them into the two groups of four. The whole logic of the experiment then turns on a piece of counting.

There are C(8,4) = 70 ways of choosing four cups out of eight. If she cannot in fact discriminate, every one of those 70 divisions is equally likely, so the chance that she names exactly the right four is just 1 in 70 — about 0.014. Anything short of a perfect score is something pure guessing would readily produce.

The null hypothesis (p. 18)

the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

The cups are presented in random order not as a courtesy but as the foundation of the test: it is the experimenter's own act of randomisation that makes the 70 arrangements equally probable, and so manufactures the very yardstick against which the result is judged. The same act scatters uncontrolled nuisances — a cup gone cold, the order of tasting — at random across the comparison instead of letting them line up with it.

[ … ]

What the book builds from the cup

From this miniature Fisher develops the toolkit of experimental design: replication (to measure the experiment's own error), local control and the Latin square (to sweep out known gradients), and the factorial experiment (varying several factors at once to read off their interactions), with the analysis of variance as the arithmetic that partitions the result. The complete book runs to some 250 pages and is available in full at the source below.

Galton Laboratory, University College London · 1935