統計學 1935

《實驗設計》

羅納德·費雪

把實驗隨機化，讓資料有機會證明你錯了。

Choose your version

In depth · the introduction

一位統計學家、一個夏日午後、一杯茶——就此成了科學學會「如何做一場公正實驗」的那一刻。

把這個想法拆開看

在費雪之前，實驗常是一團亂麻：改點什麼，看看結果，再為「這結果算不算真的」爭論不休。費雪把問題磨得更鋒利。他問的不是「我的想法對不對」，而是「純粹的運氣，有沒有可能造出這般醒目的結果」。你先立下一個故意平淡的說法——虛無假設：根本沒有真效應，全是碰巧——再造一場實驗，讓事實有一個公平的機會去把它推翻。

他的第二招，是隨機化。把實驗的順序與佈局交給偶然來定——拋一枚硬幣、洗一次牌——你便把運氣從敵人變成了量尺。如今你能精確算出：單憑運氣，會有多大比例騙到你；而唯有當一個效應勝過這個機率時，你才選擇相信它。

它從哪裡來

場景是 1920 年代英國的羅森斯特德農業研究站。一位同事——研究藻類的科學家穆里爾·布里斯托——堅稱一杯茶的味道，會因先倒牛奶還是先倒茶而不同。費雪覺得是無稽之談；在場的另一位科學家威廉·羅奇卻乾脆說：那就考考她。

費雪在羅森斯特德耗了多年，把雜亂的作物試驗資料整治成形，一路上發明了變異數分析與一套良好實驗設計的法則。在那場茶會裡，他看見了一整套實驗哲學的縮影——並在 1935 年，用它作為《實驗設計》一書的開篇。

它為何重要

它交給每一門科學一套共享而誠實的程序，用來分辨「真訊號」與「走運的巧合」，也交給它們一套共享的語言——虛無假設、顯著性、p 值——好在其中爭辯。而最要緊的，是它把隨機化放到了「可信證據」的正中央。隨機對照試驗——我們之所以能相信一種藥是真有效、而非看起來有效的緣由——正是費雪那杯茶，徹底長大後的模樣。

一枚被磨利的硬幣

假設有位朋友聲稱，他能在硬幣落地前就喊準正反。喊對一次，什麼也證明不了——誰不是有一半的時候矇對。可要是接連喊對十次呢？純靠運氣，大約一千次裡才成一次，於是你開始信他了。費雪的茶檢驗，正是這件事，只是被弄精確了：八只杯子，分成兩組各四杯共有七十種分法，所以全部分對，靠運氣七十次裡才有一次——稀罕到值得認真對待。在下方，親自扮一回那位女士。

之前與之後

費雪站在更古老的機率數學之上——包括 bayes-1763——卻把它扭向一個嶄新而實用的方向：不是更新信念，而是設計實驗。他的顯著性檢驗，不久便受到耶日·奈曼與埃貢·皮爾遜的挑戰，他們那套對立的「錯誤率」框架，至今仍與之並立競爭；也受到貝氏派的挑戰。然而他所力倡的隨機實驗，卻成了從醫學到經濟學的證據黃金標準。當你讀到某個發現「具有統計顯著性」，你讀到的，正是費雪。

The original document

Original source text

R. A. Fisher · The Design of Experiments · Oliver and Boyd, Edinburgh · 1935

Chapter II · The Principles of Experimentation, Illustrated by a Psycho-Physical Experiment

A lady declares that by tasting a cup of tea made with milk she can discriminate whether the milk or the tea infusion was first added to the cup.

Fisher sets out the test. Eight cups are prepared — four with the milk poured first, four with the tea poured first — and presented to the lady in random order; she is told that there are four of each kind, and asked to divide them into the two groups of four. The whole logic of the experiment then turns on a piece of counting.

There are C(8,4) = 70 ways of choosing four cups out of eight. If she cannot in fact discriminate, every one of those 70 divisions is equally likely, so the chance that she names exactly the right four is just 1 in 70 — about 0.014. Anything short of a perfect score is something pure guessing would readily produce.

The null hypothesis (p. 18)

the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis.

The cups are presented in random order not as a courtesy but as the foundation of the test: it is the experimenter's own act of randomisation that makes the 70 arrangements equally probable, and so manufactures the very yardstick against which the result is judged. The same act scatters uncontrolled nuisances — a cup gone cold, the order of tasting — at random across the comparison instead of letting them line up with it.

[ … ]

What the book builds from the cup

From this miniature Fisher develops the toolkit of experimental design: replication (to measure the experiment's own error), local control and the Latin square (to sweep out known gradients), and the factorial experiment (varying several factors at once to read off their interactions), with the analysis of variance as the arithmetic that partitions the result. The complete book runs to some 250 pages and is available in full at the source below.

Galton Laboratory, University College London · 1935