by Neal R. Wagner
Copyright © 2001 by Neal R. Wagner. All rights reserved.
NOTE: This site is obsolete. See book draft (in PDF):
Starting with a a file (that is, a message), one wants a compressed file that is smaller than the original and from with one can recover the original file exactly. The word "lossless" means that not one bit of information is lost during the compression/decompression process.
Claude Shannon proved the following significant result:
To achieve this result, it may be necessary to lump together a large number of messages. In contrast to the proof of Shannon's Noisy Coding Theorem (discussed in the chapter on coding theory), Shannon's Noiseless Coding Theorem has a constructive proof given below as a reasonable method for data compression, though the method is not used any more for actual compression.
Intuitively, a random message has the largest entropy and allows no compression at all. A message that is not random will have some "regularity" to it, some predictability, some "patterns" in the sequence of its bits. Such patterns could be described in a more succinct way, leading to compression.
These concepts provide a new way to describe random sequences: A finite sequence is random if it has no succinct representation, that is, any program or algorithm that will generate the sequence is at least as long as the sequence itself. This is the concept of algorithmic information theory, invented by Chaitin and Kolmogoroff.
Still speaking intuitively, the result of a good compression algorithm is a file that appears random. (If it did not look random, it would be possible to compress it further.) Also, a good compression algorithm will end up relatively close to the entropy, so one knows that no further compression is possible.
Just as with inventors of perpetual motion machines, crackpots fixated on compression regularly announce fantastic compression schemes without knowledge of the limitations imposed by information theory. As I write this (2002), a small company is claiming to compress any random file and recover every bit of the original during decompression, assuming the original file is ``large enough''. Simply counting the number of possible compressed files and the number of their compressed forms easily shows that these claims are impossible, and the argument requires nothing subtle from information theory. People making claims about perpetual motion or faster than light travel are just suggesting a violation of the accepted laws of physics, something that might be true here or in another universe, but the compression crazies are suggesting a violation of the laws of logic -- impossible for reasonable people to imagine.
For one very simple example of compression, consider a file with many sequences of blank characters. One needs a special character (or sequence of characters) to represent a sequence of blanks. Suppose we use a Ascii blank itself for this purpose. This special character must always be followed by an 8-bit number representing the number of blanks. Thus a single blank is represented by two bytes: a normal blank, followed by a 1. If the file contains many long sequences of blanks, each such sequence shorter than 256 blanks could be represented by two bytes. This might provide a large compression. On the other hand, if the file mostly consists of isolated blanks, the above technique will replace single blanks by two-byte sequences, so the "compression" algorithm will output a larger file than the input.
A later section below presents Huffman's compression algorithm, which in a sense is optimal. Huffman's code will provide the explicit solution to Shannon's Noiseless Coding Theorem, but Huffman's algorithm has significant disadvantages. It usually needs an initial statistical analysis of the file itself, and it usually requires transmitting a large decoding table along with the file. For these and other reasons, Huffman's code has been replaced with a large number of other clever compression codes. The complete description if far beyond the scope of this discussion, but the .gif images or the bit stream processed by a modem use very sophisticated algorithms that adapt to the nature of the source file. These methods allow the receiving station to construct a decode table "on the fly" as it carries out decompression.
Images with a .gif suffix use the LZW compression algorithm, which has an interesting history. Two researches named Lempel and Ziv came up with an remarkable compression algorithm which they published in scientific literature, though their companies also patented it: the (surprise) Lempel-Zip method. Later an employee of Unisys named Welch made minor modifications to Lempel-Ziv to produce the LZW algorithm used in .gif images. Unisys patented the algorithm and after its use became widespread started demanding payments for it. I personally resent this situation because Unisys and even Welch had relatively little to do with the breakthrough -- Lempel and Ziv were the ones with the insight -- yet Unisys wants money for its minor modification of a standard algorithm.
Here "lossy" means that the process of compression followed by decompression does not have to yield exactly the original file. One accepts a loss of quality. (Be careful not to spell the word as "lousy" or "loosey", etc.) When the file is executable computer code, this would be unacceptable -- the lower-quality file would not likely be executable any more. If the original file is a picture, however, the recovered file (after compression/decompression) may look nearly as good as the original.
Lossy compression is a huge area with many applications for our society. These include the Jpeg standard for individual pictures and the Mpeg standard for motion pictures. Both these standards are marvels of compression technology. Jpeg is used in modern digital cameras, allowing the image to be saved in far less memory than the original representation of the color of each pixel in the image. Mpeg is the basis for modern DVD movie players and for satellite transmission of motion pictures.
Variable Length and Prefix Codes: Elsewhere in this book codewords only occur that are all the same length (for a given code). However the Huffman code in this section is a variable length code. Another well-known but old-fashioned variable length code is the Morse code widely used for radio and telegraph transmissions. The table below gives the code. The idea was to use short codewords for the most commonly occurring letters and longer codewords for less frequent letters.
| International Morse Code | |||||||
| Letter | Morse | Letter | Morse | Digit | Morse | ||
| A | .- | N | -. | 0 | ----- | ||
| B | -... | O | --- | 1 | .---- | ||
| C | -.-. | P | .--. | 2 | ..--- | ||
| D | -.. | Q | --.- | 3 | ...-- | ||
| E | . | R | .-. | 4 | ....- | ||
| F | ..-. | S | ... | 5 | ..... | ||
| G | --. | T | - | 6 | -.... | ||
| H | .... | U | ..- | 7 | --... | ||
| I | .. | V | ...- | 8 | ---.. | ||
| J | .--- | W | .-- | 9 | ----. | ||
| K | -.- | X | -..- | ||||
| L | .-.. | Y | -.-- | ||||
| M | -- | Z | --.. | ||||
Morse code presents an immediate decoding problem, since for example, an N is "-.", but the codewords for B, C, D, K, X, and Y, also start with ".-". In fact, the code for C is the same as the code for N repeated twice. Just given a sequence of dots and dashes, it is not possible to uniquely break the sequence into letters. For this reason, Morse code requires an extra "symbol": a short pause between letters. (There is a longer pause between words.)
The Huffman code that is the subject here does not have this problem. It is a prefix code, meaning that no codeword is a prefix of another codeword. Not only is it possible to separate any string of 0s and 1s uniquely into codewords, but the decoding is very easy, since a unique entry always matches the next part of the input. (There are other codes that do not have the prefix property, but that are nevertheless uniquely decodable. Such codes are not desirable because of the difficulty of decoding.)
The Huffman code starts with a sequence of symbols (a "file") and computes the percent frequency of each symbol.
Example 1. For example, if the sequence (or file) is aaaabbcd, then the frequency table is:
Symbol: a, Weight: 0.5
Symbol: b, Weight: 0.25
Symbol: c, Weight: 0.125
Symbol: d, Weight: 0.125
The Huffman algorithm first
regards the sequence of symbols as a sequence of single-element trees
with the symbol and the frequency at the root node of each
tree (the only node in this case).
Then the algorithm repeatedly combines the two least frequent root nodes
as left and right subtrees of a new root node
with symbol @ and frequency the sum of the two
previous frequencies. (I just chose @ at random.)Thus, referring to the picture below, the first step combines the single-node trees for letters c and d, each with frequencies 0.125, into a single tree, with subtrees the trees above, and a root node with frequency 0.125 + 0.125 = 0.25. Now this process is repeated with the new node and the old node for the letter b with frequency 0.25. The final combination yields a single tree with frequency 1.0.
If there are multiple choices for the "least frequent" root nodes, then make any choice. The resulting Huffman trees and the resulting codes may not be the same, but the average code lengths must always work out the same.
In the specific example above, one gets
+---d: 0.1250 (step 1)
|
+---+---@: 0.2500 (step 2)
| |
| +---c: 0.1250 (step 1)
|
+---+---@: 0.5000 (step 3)
| |
| +---b: 0.2500 (step 2)
|
---+---@: 1.0000
|
+---a: 0.5000 (step 3)
So step 1 combines c and d,
yielding a new root node of frequency 0.25.
Then step 2 combines
the result of step 1 with b. That result is combined
in step 3 with a, to give a single root node with
frequency 1.0.In the final part of the algorithm, one heads down the tree from the root as shown above, building a code string as one goes, adding a 0 as one goes up and a 1 as one goes down. The resulting binary codes for symbols are shown in bold below (note the codes for intermediate nodes also):
+---d: 0.1250, 000 (step 1)
|
+---+---@: 0.2500, 00 (step 2)
| |
| +---c: 0.1250, 001 (step 1)
|
+---+---@: 0.5000, 0 (step 3)
| |
| +---b: 0.2500, 01 (step 2)
|
---+---@: 1.0000,
|
+---a: 0.5000, 1 (step 3)
To encode using the result of the Huffman algorithm, one makes up
a code table consisting of each symbol followed by the
corresponding codeword. To encode, look up each symbol in the
table, and fetch the corresponding codeword.
(Encoding can be efficient if the table is arranged so that
binary search or hashing is possible.)
Symbol: a, Codeword: 1
Symbol: b, Codeword: 01
Symbol: c, Codeword: 001
Symbol: d, Codeword: 000
The same table could be used for decoding, by looking up
successive sequences of code symbols, but this would not be
efficient.
The process of decoding can be made simple and efficient by using
the above Huffman coding tree itself. Start at the root (left side) of
the tree and process the code symbols
0 and 1 one at a time.
For a 0 head upward and for a 1
head downward. When a leaf node (one with no subnodes)
is reached, the symbol at that
node is the one being decoded. For example, in decoding the
string 001, start at the root, head up because
of the first (leftmost) 0. Then head up again
because of the second 0. Finally head down
because of the final 1. This is now a leaf node
holding c, so that is the symbol decoded from
001.
The diagram below shows the path through the tree in red boldface:
+---d: 0.1250, 000 (step 1)
|
+---+---@: 0.2500, 00 (step 2)
| |
| +---c: 0.1250, 001 (step 1)
|
+---+---@: 0.5000, 0 (step 3)
| |
| +---b: 0.2500, 01 (step 2)
|
---+---@: 1.0000,
|
+---a: 0.5000, 1 (step 3)
The entropy of the four symbols above with the given probabilities
is 1.75, and this is exactly the same as the
average code length, given by
0.5 * 1 + 0.25 * 2 + 0.125 * 3 + 0.125 * 3 = 1.75.Huffman codes are always optimal (the best possible), but this particular code has average code length equal to the entropy, and it is never possible to create a code with shorter average length. Most Huffman codes have average code length greater than the entropy (unless all frequencies are a fraction with numerator and denominator a power of 2). The next simple example shows this.
Example 2. Start with the file aaaaaabc. Here is the frequency table and the tree, along with the the code strings:
Symbol: a, Weight: 0.75
Symbol: b, Weight: 0.125
Symbol: c, Weight: 0.125
+---c: 0.1250, 00 (step 1)
|
+---+---@: 0.2500, 0 (step 2)
| |
| +---b: 0.1250, 01 (step 1)
|
---+---@: 1.0000,
|
+---a: 0.7500, 1 (step 2)
In this case, the entropy is 1.061278 while the
average code length is 1.25.Product Codes: Now suppose one forms the "product" code of the code in Example 2 by considering all possible pairs of symbols and their respective probabilities, which are the products of the respective probabilities:
Symbol: A (for aa), Weight: 0.5625 = 0.75 * 0.75 Symbol: B (for ab), Weight: 0.09375 = 0.75 * 0.125 Symbol: C (for ba), Weight: 0.09375 = 0.125 * 0.75 Symbol: D (for ac), Weight: 0.09375 = 0.75 * 0.125 Symbol: E (for ca), Weight: 0.09375 = 0.125 * 0.75 Symbol: F (for bb), Weight: 0.015625 = 0.125 * 0.125 Symbol: G (for bc), Weight: 0.015625 = 0.125 * 0.125 Symbol: H (for cb), Weight: 0.015625 = 0.125 * 0.125 Symbol: I (for cc), Weight: 0.015625 = 0.125 * 0.125Here is the corresponding Huffman tree, along with the code words:
+---D: 0.0938, 000 (step 5)
|
+---+---@: 0.1875, 00 (step 7)
| |
| +---C: 0.0938, 001 (step 5)
|
+---+---@: 0.4375, 0 (step 8)
| |
| | +---B: 0.0938, 010 (step 6)
| | |
| +---+---@: 0.2500, 01 (step 7)
| |
| | +---G: 0.0156, 011000 (step 2)
| | |
| | +---+---@: 0.0312, 01100 (step 3)
| | | |
| | | +---F: 0.0156, 011001 (step 2)
| | |
| | +---+---@: 0.0625, 0110 (step 4)
| | | |
| | | | +---I: 0.0156, 011010 (step 1)
| | | | |
| | | +---+---@: 0.0312, 01101 (step 3)
| | | |
| | | +---H: 0.0156, 011011 (step 1)
| | |
| +---+---@: 0.1562, 011 (step 6)
| |
| +---E: 0.0938, 0111 (step 4)
|
---+---@: 1.0000,
|
+---A: 0.5625, 1 (step 8)
With this product code, the entropy and average code length work out to be
2.122556 and 2.15625,
but each new symbol (upper-case letter) represents two of
the original symbols. Dividing by 2
gives the original value for the entropy, but the average code length
(per original symbol, not per new symbol) is
1.078125, which is much closer to the
entropy value of 1.061278
than the previous 1.25.If one takes three symbols at a time, the average code length goes down to 1.0703125. Continuing with four at a time, five at a time, and so forth, it can be proved that the resulting average code lengths get arbitrarily close to the entropy: 1.061278. This, in turn, proves Shannon's Noiseless Coding Theorem, stated earlier.
Even though Huffman's code is optimal (it yields the best possible code for a collection of symbols and frequencies), the other adaptive algorithms (LZ or LZW) usually do a much better job of compressing a file. How can this be, since Huffman is optimal? Suppose one gathers statistics of frequencies of letters in English and creates an optimal Huffman code. Such an analysis does not consider strings of letters in English, but only individual letters. The Huffman product code for two letters at a time just assumes the frequencies are the products of individual frequencies, but this is not true for actual English. For example, for pairs of letters with "q" as the first letter, most pairs have probability 0, while the pair "qu" is about as common as "q" by itself. It is possible to gather statistics about pairs of letters in English and create a Huffman code for the pairs. One could proceed to statistics about longer strings of letters and to corresponding Huffman codes. The resulting Huffman codes would eventually perform better than the adaptive algorithms, but they would require an unacceptable amount of statistical processing, and an unacceptably large code table. In contrast, the adaptive algorithms do not need to transmit a code table, and they eventually adapt to long strings that occur over and over.
Here is a computer implementation of the Huffman algorithm: Huffman Java source. This software generates printouts of the Huffman trees and other data reproduced above, but it does not read or write actual binary data, nor does it transmit the dictionary (or the frequency table) along with the data, so that decoding is possible at the other end. See Huffman case study for a more complete implementation.
Exercises. Devise a simple example where there are different choices for the least trees and where the Huffman algorithm yields different answers. Get an example where there are even two different distributions for the lengths of the codewords. Verify that the average code lengths are the same for the two examples. [Ans: See here for an answer.]