JOVANA
Library Glossary Getting Started Three Levels Fields How it works Mission
Join the mission
Back to the library
信息论 / 计算机科学 1952

一种构造最小冗余编码的方法

大卫·A·霍夫曼

让常见的符号用短码、罕见的用长码——便得到最短的编码。

Choose your version
In depth · the introduction

如果你能免费地把一条消息压得更小,只需给最常见的字母最短的码——就像摩尔斯电码给「E」配上一个单点那样,会怎样?

核心想法

计算机把一切都存成比特——0 和 1。偷懒的办法,是给每个字母同样数目的比特,比如每个八位。可字母并非同样常见:在英文里,「e」和空格不停出现,而「q」和「z」很罕见。那为什么要给每个字母留同样大的地方呢?

霍夫曼的方法,把短码发给常见的符号,把长码发给罕见的符号,且比例恰到好处,使消息的平均长度尽可能地短。而且还有一个美妙的保证:没有任何能被无歧义读回的编码,能做得比它更好。这不是一个「通常管用」的聪明把戏——它可证明是最好的。

它是如何诞生的

1951 年,大卫·霍夫曼是麻省理工的一名研究生,正在上罗伯特·法诺开的信息论课——法诺是克劳德·香农的同事,而香农三年前刚刚开创了整个这门学问。法诺给学生一个选择:要么参加期末考试,要么写一篇解决某个未决问题的学期论文。那个问题,恰恰就是这一个:找出最高效的这类编码。法诺与香农各自找到过不错的方法,可没人证明哪一个是最好的。

霍夫曼苦苦琢磨了几周,几乎已经放弃——正要把笔记扔掉时——关键的想法到来了:从底部往上构造编码,先把两个最罕见的符号合并起来,而不是像他的老师那样自顶向下地拆分。这一倒转,便是整个解答。这名学生,解出了他那位著名教授所未能解出的难题。

它为何重要

霍夫曼编码,成了有史以来被使用得最广的算法之一。每当你打开一个 ZIP 文件、看一张 JPEG 照片、或播一首 MP3,都有一棵霍夫曼树在悄然运转,把符号装进尽可能少的比特里。因为它简单、快速,且在同类编码中可证明最优,它成了压缩的默认基石——七十年过去,至今仍是。

一个可以想象的画面

想想收拾行李箱。你每天都用的东西——手机、牙刷——你想放在最上层,触手可得;难得用一回的,可以塞到箱底,不那么好拿。霍夫曼对符号也这么做:你最常「伸手去拿」的那些,得到最短的码;而罕见之物,沉到最深、最长的路径上去。如此一摆放,整只箱子便装得尽可能地紧。

一棵覆盖八个符号 A–H 的可交互霍夫曼树:拖动滑块,让某些符号比别的常见得多,看树如何重建——常见符号得到短码、罕见符号得到长码,平均码长降到定长码所用的 3 比特以下。

它的位置

霍夫曼站在克劳德·香农的肩上——香农 1948 年的理论(本馆亦有收录)定义了熵:一条消息究竟最少需要多少比特的真正下界,并证明那个下界可以被逼近。香农与法诺找到了接近它的编码;霍夫曼则找到了「逐符号编码」所能达到的最好那一个。在他之后,来了算术编码,以及更晚近的 ANS,把界限压得更贴近香农的极限。从电报、到 ZIP、再到流媒体视频,这正是一条「把更多意义塞进更少比特」的脉络。

The original document
Original source text

摘要与问题

David A. Huffman · Proceedings of the IRE, vol. 40, no. 9 · September 1952 · pp. 1098–1101 (manuscript received December 6, 1951)
An optimum method of coding an ensemble of messages consisting of a finite number of members is developed. A minimum-redundancy code is one constructed in such a way that the average number of coding digits per message is minimized.
One important method of transmitting messages is to transmit in their place sequences of symbols. If there are more messages which might be sent than there are kinds of symbols available, then some of the messages must use more than one symbol.
Probably the most familiar ensemble code was stated in the phrase “one if by land and two if by sea.” In this case, the message ensemble consisted of the two individual messages “by land” and “by sea,” and the message codes were “one” and “two.”

两条基本限制

The following basic restrictions will be imposed on an ensemble code: (a) No two messages will consist of identical arrangements of coding digits. (b) The message codes will be constructed in such a way that no additional indication is necessary to specify where a message code begins and ends once the starting point of a sequence of messages is known.
Restriction (b) necessitates that no message be coded in such a way that its code appears, digit for digit, as the first part of any message code of greater length.
C. E. Shannon and R. M. Fano have developed ensemble coding procedures … Their coding procedures are not optimum, but approach the optimum behavior when N approaches infinity. … However, up to the present time, no definite procedure has been suggested for the construction of such a code to the knowledge of the author. It is the purpose of this paper to derive such a procedure.

最优二进制步骤

Derived coding requirements
For an optimum code, the length of a given message code can never be less than the length of a more probable message code. If this requirement were not met, then a reduction in average message length could be obtained by interchanging the codes for the two messages in question.
Optimum binary code
Restriction (c) makes it necessary that the two least probable messages have codes of equal length. … It will be necessary to assign these two message codes to the Nth and the (N–1)st messages. … Once this has been done, these two messages are equivalent to a single composite message. … Its probability will be the sum of the probabilities of the two messages from which it was created. The ensemble containing this composite message in the place of its two component messages will be called the first auxiliary message ensemble.
The procedure is applied again and again until the number of members in the most recently formed auxiliary message ensemble is reduced to two. One of each of the binary digits is assigned to each of these two composite messages. These messages are then combined to form a single composite message with probability unity, and the coding is complete.
[ … ]
In the worked example (Table II), an ensemble of N = 13 messages is coded; the resulting average message length is Lav = 3.42 binary digits per message — below the four bits a fixed-length code would need, and close to the source entropy.

推广与致谢

Optimum coding of an ensemble of messages using three or more types of digits is similar to the binary coding procedure. … in order to satisfy restriction (e), it will be required that all these brackets, with the possible exception of one combining the least probable messages of the original ensemble, always combine a number of messages equal to D.
The author is indebted to Dr. W. K. Linvill and Dr. R. M. Fano, both of the Massachusetts Institute of Technology, for their helpful criticism of this paper.