统计学 1935

《实验设计》

罗纳德·费希尔

把实验随机化，让数据有机会证明你错了。

Choose your version

In depth · the introduction

一位统计学家、一个夏日午后、一杯茶——就此成了科学学会「如何做一场公正实验」的那一刻。

把这个想法拆开看

在费希尔之前，实验常是一团乱麻：改点什么，看看结果，再为「这结果算不算真的」争论不休。费希尔把问题磨得更锋利。他问的不是「我的想法对不对」，而是「纯粹的运气，有没有可能造出这般醒目的结果」。你先立下一个故意平淡的说法——零假设：根本没有真效应，全是碰巧——再造一场实验，让事实有一个公平的机会去把它推翻。

他的第二招，是随机化。把实验的顺序与布局交给偶然来定——抛一枚硬币、洗一次牌——你便把运气从敌人变成了量尺。如今你能精确算出：单凭运气，会有多大比例骗到你；而唯有当一个效应胜过这个概率时，你才选择相信它。

它从哪里来

场景是 1920 年代英国的罗森斯特德农业研究站。一位同事——研究藻类的科学家穆里尔·布里斯托——坚称一杯茶的味道，会因先倒牛奶还是先倒茶而不同。费希尔觉得是无稽之谈；在场的另一位科学家威廉·罗奇却干脆说：那就考考她。

费希尔在罗森斯特德耗了多年，把杂乱的作物试验数据整治成形，一路上发明了方差分析与一套良好实验设计的法则。在那场茶会里，他看见了一整套实验哲学的缩影——并在 1935 年，用它作为《实验设计》一书的开篇。

它为何重要

它交给每一门科学一套共享而诚实的程序，用来分辨「真信号」与「走运的巧合」，也交给它们一套共享的语言——零假设、显著性、p 值——好在其中争辩。而最要紧的，是它把随机化放到了「可信证据」的正中央。随机对照试验——我们之所以能相信一种药是真有效、而非看起来有效的缘由——正是费希尔那杯茶，彻底长大后的模样。

一枚被磨利的硬币

假设有位朋友声称，他能在硬币落地前就喊准正反。喊对一次，什么也证明不了——谁不是有一半的时候蒙对。可要是接连喊对十次呢？纯靠运气，大约一千次里才成一次，于是你开始信他了。费希尔的茶检验，正是这件事，只是被弄精确了：八只杯子，分成两组各四杯共有七十种分法，所以全部分对，靠运气七十次里才有一次——稀罕到值得认真对待。在下方，亲自扮一回那位女士。

之前与之后

费希尔站在更古老的概率数学之上——包括 bayes-1763——却把它扭向一个崭新而实用的方向：不是更新信念，而是设计实验。他的显著性检验，不久便受到耶日·奈曼与埃贡·皮尔逊的挑战，他们那套对立的「错误率」框架，至今仍与之并立竞争；也受到贝叶斯派的挑战。然而他所力倡的随机实验，却成了从医学到经济学的证据黄金标准。当你读到某个发现「具有统计显著性」，你读到的，正是费希尔。

The original document

Original source text

R. A. Fisher · The Design of Experiments · Oliver and Boyd, Edinburgh · 1935

Chapter II · The Principles of Experimentation, Illustrated by a Psycho-Physical Experiment

A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.

Fisher sets out the test. Eight cups are prepared — four with the milk poured first, four with the tea poured first — and presented to the lady in random order; she is told that there are four of each kind, and asked to divide them into the two groups of four. The whole logic of the experiment then turns on a piece of counting.

There are C(8,4) = 70 ways of choosing four cups out of eight. If she cannot in fact discriminate, every one of those 70 divisions is equally likely, so the chance that she names exactly the right four is just 1 in 70 — about 0.014. Anything short of a perfect score is something pure guessing would readily produce.

The null hypothesis (p. 18)

the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

The cups are presented in random order not as a courtesy but as the foundation of the test: it is the experimenter's own act of randomisation that makes the 70 arrangements equally probable, and so manufactures the very yardstick against which the result is judged. The same act scatters uncontrolled nuisances — a cup gone cold, the order of tasting — at random across the comparison instead of letting them line up with it.

[ … ]

What the book builds from the cup

From this miniature Fisher develops the toolkit of experimental design: replication (to measure the experiment's own error), local control and the Latin square (to sweep out known gradients), and the factorial experiment (varying several factors at once to read off their interactions), with the analysis of variance as the arithmetic that partitions the result. The complete book runs to some 250 pages and is available in full at the source below.

Galton Laboratory, University College London · 1935