JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
Back to the library
資訊理論 / 電腦科學 1952

一種構造最小冗餘編碼的方法

大衛·A·霍夫曼

讓常見的符號用短碼、罕見的用長碼——便得到最短的編碼。

Choose your version
In depth · the introduction

如果你能免費地把一條訊息壓得更小,只需給最常見的字母最短的碼——就像摩斯電碼給「E」配上一個單點那樣,會怎樣?

核心想法

電腦把一切都存成位元——0 和 1。偷懶的辦法,是給每個字母同樣數目的位元,比如每個八位。可字母並非同樣常見:在英文裡,「e」和空格不停出現,而「q」和「z」很罕見。那為什麼要給每個字母留同樣大的地方呢?

霍夫曼的方法,把短碼發給常見的符號,把長碼發給罕見的符號,且比例恰到好處,使訊息的平均長度盡可能地短。而且還有一個美妙的保證:沒有任何能被無歧義讀回的編碼,能做得比它更好。這不是一個「通常管用」的聰明把戲——它可證明是最好的。

它是如何誕生的

1951 年,大衛·霍夫曼是麻省理工的一名研究生,正在上羅伯特·法諾開的資訊論課——法諾是克勞德·香農的同事,而香農三年前剛剛開創了整個這門學問。法諾給學生一個選擇:要麼參加期末考試,要麼寫一篇解決某個未決問題的學期論文。那個問題,恰恰就是這一個:找出最高效的這類編碼。法諾與香農各自找到過不錯的方法,可沒人證明哪一個是最好的。

霍夫曼苦苦琢磨了幾週,幾乎已經放棄——正要把筆記扔掉時——關鍵的想法到來了:從底部往上構造編碼,先把兩個最罕見的符號合併起來,而不是像他的老師那樣自頂向下地拆分。這一倒轉,便是整個解答。這名學生,解出了他那位著名教授所未能解出的難題。

它為何重要

霍夫曼編碼,成了有史以來被使用得最廣的演算法之一。每當你打開一個 ZIP 檔案、看一張 JPEG 照片、或播一首 MP3,都有一棵霍夫曼樹在悄然運轉,把符號裝進盡可能少的位元裡。因為它簡單、快速,且在同類編碼中可證明最優,它成了壓縮的預設基石——七十年過去,至今仍是。

一個可以想像的畫面

想想收拾行李箱。你每天都用的東西——手機、牙刷——你想放在最上層,觸手可得;難得用一回的,可以塞到箱底,不那麼好拿。霍夫曼對符號也這麼做:你最常「伸手去拿」的那些,得到最短的碼;而罕見之物,沉到最深、最長的路徑上去。如此一擺放,整隻箱子便裝得盡可能地緊。

一棵覆蓋八個符號 A–H 的可互動霍夫曼樹:拖動滑桿,讓某些符號比別的常見得多,看樹如何重建——常見符號得到短碼、罕見符號得到長碼,平均碼長降到定長碼所用的 3 位元以下。

它的位置

霍夫曼站在克勞德·香農的肩上——香農 1948 年的理論(本館亦有收錄)定義了熵:一條訊息究竟最少需要多少位元的真正下界,並證明那個下界可以被逼近。香農與法諾找到了接近它的編碼;霍夫曼則找到了「逐符號編碼」所能達到的最好那一個。在他之後,來了算術編碼,以及更晚近的 ANS,把界限壓得更貼近香農的極限。從電報、到 ZIP、再到串流影片,這正是一條「把更多意義塞進更少位元」的脈絡。

The original document
Original source text

摘要與問題

David A. Huffman · Proceedings of the IRE, vol. 40, no. 9 · September 1952 · pp. 1098–1101 (manuscript received December 6, 1951)
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
One important method of transmitting messages is to transmit in their place sequences of symbols. If there are more messages which might be sent than there are kinds of symbols available, then some of the messages must use more than one symbol.
Probably the most familiar ensemble code was stated in the phrase “one if by land and two if by sea.” In this case, the message ensemble consisted of the two individual messages “by land” and “by sea,” and the message codes were “one” and “two.”

兩條基本限制

The following basic restrictions will be imposed on an ensemble code: (a) No two messages will consist of identical arrangements of coding digits. (b) The message codes will be constructed in such a way that no additional indication is necessary to specify where a message code begins and ends once the starting point of a sequence of messages is known.
Restriction (b) necessitates that no message be coded in such a way that its code appears, digit for digit, as the first part of any message code of greater length.
C. E. Shannon and R. M. Fano have developed ensemble coding procedures … Their coding procedures are not optimum, but approach the optimum behavior when N approaches infinity. … However, up to the present time, no definite procedure has been suggested for the construction of such a code to the knowledge of the author. It is the purpose of this paper to derive such a procedure.

最優二進制步驟

Derived coding requirements
For an optimum code, the length of a given message code can never be less than the length of a more probable message code. If this requirement were not met, then a reduction in average message length could be obtained by interchanging the codes for the two messages in question.
Optimum binary code
Restriction (c) makes it necessary that the two least probable messages have codes of equal length. … It will be necessary to assign these two message codes to the Nth and the (N–1)st messages. … Once this has been done, these two messages are equivalent to a single composite message. … Its probability will be the sum of the probabilities of the two messages from which it was created. The ensemble containing this composite message in the place of its two component messages will be called the first auxiliary message ensemble.
The procedure is applied again and again until the number of members in the most recently formed auxiliary message ensemble is reduced to two. One of each of the binary digits is assigned to each of these two composite messages. These messages are then combined to form a single composite message with probability unity, and the coding is complete.
[ … ]
In the worked example (Table II), an ensemble of N = 13 messages is coded; the resulting average message length is Lav = 3.42 binary digits per message — below the four bits a fixed-length code would need, and close to the source entropy.

推廣與致謝

Optimum coding of an ensemble of messages using three or more types of digits is similar to the binary coding procedure. … in order to satisfy restriction (e), it will be required that all these brackets, with the possible exception of one combining the least probable messages of the original ensemble, always combine a number of messages equal to D.
The author is indebted to Dr. W. K. Linvill and Dr. R. M. Fano, both of the Massachusetts Institute of Technology, for their helpful criticism of this paper.