Huffman
Coding
A
Case Study in Lossless Compression
Using
Variable Length Coding
Overview
File compression, particularly for multimedia data, is widely used to
reduce Internet traffic and transfer times. Two common compression formats
for images are GIF and JPEG. Both of these encoding formats
throw away information about the images, so the original image can not
be reconstructed exactly from the compressed image.
GIF and JPEG
are lossy compression techniques. Lossy compression can be very
effective for multimedia data. JPEG encoding, for example, can
reduce the size of an image by a factor of 20 or more without a noticeable
loss of image quality.
Lossy compression cannot be used for text and data files, because you
want to get the original file back when you uncompress it (i.e. lossless
compression). Many lossless compression schemes use variable length
encoding. One of the earliest and most commonly used is Huffman coding.
Every file can be thought of as a sequence of bytes (values from 0 to
255). Uncompressed files use 8 bits for each possible byte value. The idea
of Huffman coding is simply to devise an encoding scheme so that the most
frequently occurring byte values are represented by a short code (fewer
bits) and the less frequently occurring byte values are represented by
a longer code (more bits). Huffman encoding significantly reduces the size
of files, but it requires that you be able to read or write a bit at a
time.
The algorithm for Huffman coding generates a binary tree whose left
and right branches are labeled by 0 and 1 respectively
as shown in the diagram below. The leaves of the tree are unique bytes
that appear in the file (the alphabet). The path from the root to
the leaf gives the encoding of the byte represented by that node. Leaves
that are close to the root have short encodings and leaves that are farther
away have longer encodings. The trick is to generate a tree where the most
frequently occurring byte values are placed in leaves close to the root.
Example 1: The string bcbbbbbbaacaabbcade has the Huffman tree shown in Fig. 1.
Figure 1: Image of a Huffman Tree representing bcbbbbbbaacaabbcade.
The number in each node represents the index into Huffman index table.
a: 11
b: 0
c: 101
d: 1001
e: 1000

This case study implements the Huffman encoding and decoding algorithms. You should download and unzip
Huffmanoriginal.zip as the starting point for this project. The
Huffman encoding is not unique. The exact version of the encoding depends on how the priority queue treats
ties. Be sure
to set the required library of the project to the queues/classes directory in this project so that
your runs agree with those in the handout.
Huffman Decoding
To decode a Huffman-encoded bit string, start at the root of the Huffman
tree and use the input bits to determine the path to the leaf: This
is done in the method writeUnencodedFile in HuffmanDecoder.
-
Start at the root of the tree.
-
For each bit in the input stream:
-
If the bit is a 0, take the left branch.
-
If the bit is a 1, take the right branch.
-
If at a leaf, output the leaf's byte value and reset position to the root.
Exercise 1: Use the Huffman tree of Figure 1 to decode the
bit string 0001101011000.
Ans: bbbabce.
The implementation details for Huffman decoding are covered later in the case study.
Huffman Encoding
The Huffman tree of Figure 1 is generated by making an initial pass through the
data file to determine the number of times each character in the alphabet
appears in the file. The tree can be easily generated from a table of these frequencies.
The Huffman tree is constructed so that the leaves representing the
most frequently appearing bytes are closest to the root of the tree.
Each character in the input file is encoded by a bit pattern.
The encoding works from the leaves of the Huffman tree upward. Start
at the leaf node representing the next character (byte) in the original file. Follow
the path up to the root to get the encoding.
In order to implement the strategy efficiently, you will
need a parent pointer representation of the tree.
The bits come off in reverse order, so you will
need to do the traversal recursively, save the bits on a stack, or prepend rather than append the
bits to a string.
Exercise 2: Encode the file containing bcbbbbbbaacaabbcade
using the Huffman tree above.
Ans: The answer is 010100000011111011111001011110011000.
If fact, the Huffman tree was built from this file.
Algorithm for Huffman Encoding
Step I: Generate the table of frequencies
Make one pass through the original file to determine:
-
The number of bytes in the file (original file size).
-
The number of unique bytes (the alphabet size).
-
The number of times each unique byte appears in the file (the weights).
Step II: Build the encoding tree
- Put all of the leaf nodes in a priority queue based on their weights.
-
Do the following until the priority queue is empty:
-
Delete the smallest two nodes.
-
Create a new node with these deleted nodes as children.
-
Set the parent pointers of the deleted nodes in the table.
-
Set the child type of the deleted nodes in the table.
Step III: Encoding the characters
For each character in the input file, find the leaf node corresponding to that
character and then traverse the tree to its root, prepending the bits for that character.
- If current node is a left child, prepend a 0.
- If current node is a right child, prepend a 1.
- If current node has no parent, return the string for this character.
- Otherwise, the parent the current node.
The above algorithm is called a greedy algorithm because it always
picks the easiest (smallest weight) nodes. It can be proved that this greedy
algorithm always results in an optimal tree, but this proof is a topic
for Analysis of Algorithms.
Implementation of the encoding process
To implement the Huffman encoding algorithm, we will need a tree representation with parent pointers.
One possibility is to modify the BinaryTreeNode to include a parent pointer. However, the
Huffman encoding tree has a special form. We will show later that if the string has n unique letters,
the Huffman encoding tree has 2*n - 1 nodes. For trees with of a bounded size, an array representation of
the tree is very efficient. If we are only encoding documents that contain the standard ASCII character set, we
can have at most 128 unique letters (corresponding to byte values 0 through 127). Therefore the Huffman
encoding tree for ASCII documents always fits into a table of size 255.
Figure 2 represents the Huffman encoding tree of Figure 1 in an array. The parent column gives the
index of the parent in the encoding tree. The childType is true if this node is a right child
of its parent and false otherwise. The leftChild and rightChild are used later
in the decoding.
Figure 2: The Huffman encoding table for the tree of Figure 1.
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
9 |
8 |
false |
-1 |
-1 |
| 99 (c) |
1 |
3 |
6 |
true |
-1 |
-1 |
| 97 (a) |
2 |
5 |
7 |
true |
-1 |
-1 |
| 100 (d) |
3 |
1 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
1 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
2 |
6 |
false |
-1 |
-1 |
| 129 (T2) |
6 |
5 |
7 |
false |
-1 |
-1 |
| 130 (T3) |
7 |
10 |
8 |
true |
-1 |
-1 |
| 131 (T4) |
8 |
19 |
-1 |
false |
-1 |
-1 |
One of the difficulties with the table representation, is that during encoding, we have to do a linear search
for each letter in the input string in order to find its position in the Huffman encoding table.
We will use the inverted table as shown in Figure 3 to speed up the
lookup process. The index entry for each
ASCII character appears in the same index position as its ASCII code. For example, 'b' has
ASCII code 98, so it appears in position 98 of the inverted table. Then entry in
position 98 is 0, corresponding to the index into the encoding table. We only need
entries for the possible input characters, not for the temporary nodes in the Huffman tree, so
the inverted table only has 128 entries. Databases frequently use inverted tables to speed lookup.
Figure 3: The index or inverted table for the Huffman encoding table of Figure 2.
| Index |
Entry (Index in Huffman encoding table) |
| 0 |
-1 |
| 1 |
-1 |
| 2 |
-1 |
| ... |
... |
| 97 (a) |
2 |
| 98 (b) |
0 |
| 99 (c) |
1 |
| 100 (d) |
3 |
| 101 (e) |
4 |
| ... |
... |
| 127 |
-1 |
Format
for the Huffman Table
Encoding and decoding will be done using an encoding table implemented as an
array of HuffmanCodingEntry objects that have the following fields:
- value: the byte that entry represents.
- index: the index of the byte in the Huffman table or -1
if the entry is not in the table.
- weight: The number of times the byte appears in the file.
- parent: the position in the table of the node's parent after the
encoding tree is built. (The root of the encoding tree has parent
of -1.
-
childType: an indication of whether the entry is a left child
(false) or a right child (true) of its parent.
-
rightChild: a reference to its right child if the entry is not
a leaf, otherwise -1.
-
leftChild: a reference to its left child if
the node is not a leaf, otherwise -1.
For encoding, only the
value,
weight,
parent
and childType entries are relevant.
Initial scan to compute character counts (Step I of the encoding algorithm)
The computeCharacterCounts method of Figure 4 makes the initial pass
through the document to compute the character counts
(Part I of the algorithm).
Before computeCharacterCounts is
called, the childType of each entry is initialized to zero(false)
and the
parent, leftChild and rightChild values of each entry
are initialized to -1. The currentTableSize gives the number of
unique characters encountered so far in the document.
Figure 3: Code to implement Step I of the Huffman encoding algorithm
private void computeCharacterCounts() throws IOException {
FileInputStream f = null; // could also use BitInputStream here
try { // we catch and rethrow the exception to be sure to close file
f = new FileInputStream(unencodedFileName);
int thisByte;
while ( (thisByte = f.read()) != -1) {
insertInTable(thisByte, 1);
unencodedFileSize++;
}
} catch (Exception e) {
throw new IOException("Exception " + e.getMessage() +
"reading " + unencodedFileName);
} finally { // always want to close the file no matter what
if (f != null)
f.close();
}
alphabetSize = currentTableSize; // now have the unique characters
}
private void insertInTable(int v, int weight) {
if (invertedTable[v] == -1) { // if v is not in the table add it
encodingTable[currentTableSize] =
new HuffmanCodingEntry(v, currentTableSize, weight, -1, false);
invertedTable[v] = currentTableSize;
currentTableSize++;
} else // add weight to the total weight of the character v
encodingTable[invertedTable[v]].addWeight(weight);
}
Constructing the non-leaf portion of the encoding tree (Step II of the Huffman encoding algorithm)
After a table of character frequencies has been computed, we need to construct a binary encoding tree in the table by
setting the parent entries. The leaf nodes are just the entries in
the table above. Each nonleaf node has a weight that is the sum of the
weights of its children. The tree is constructed so that any node at a
particular level in the tree has weights that are at least as great as
nodes at lower levels in the tree.
Exercise 3: After the table of frequencies have been computed, the entries are
added to a priority queue. What will that queue contain?
Ans:
| e, 4, 1,
-1, false,
-1, -1 |
| d, 3, 1,
-1, false,
-1, -1 |
| c, 1, 3,
-1, false,
-1, -1 |
| a, 2, 5,
-1, false,
-1, -1 |
| b, 0, 9,
-1, false,
-1, -1 |
Exercise 4: After the first two elements of the priority queue have been removed and
a new node whose weight is the sum of their weights is added, what will the priority queue look like?
Ans: After removing the two smallest elements (e and
d) and creating a new element, T1 whose weight is 2, the priority queue contains.
| T1, 5, 2,
-1, false,
-1, -1 |
| c, 1,
3,
-1, false,
-1, -1 |
| a, 2,
5,
-1, false,
-1, -1 |
| b, 0,
9,
-1, false,
-1, -1 |
Exercise 5: What does the encoding table look like after the operations of
Exercise 4 have been performed?
Ans: A new node containing T1 has been added as shown in Figure 5.
Notice that the parent and childType fields of the objects
representing d and e have been modified to reflect the
fact that d and e are now children of T1.
Figure 5: The Huffman Table after the nodes containing d and e have been processed.
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
9 |
-1 |
false |
-1 |
-1 |
| 99 (c) |
1 |
3 |
-1 |
false |
-1 |
-1 |
| 97 (a) |
2 |
5 |
-1 |
false |
-1 |
-1 |
| 100 (d) |
3 |
1 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
1 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
2 |
-1 |
false |
-1 |
-1 |
Exercise 6: Suppose the priority queue is now in the
state specified in the answer to Exercise 4. What does the queue
look like after
the two smallest elements from the
priority queue (T1 and
c) have been removed and combined?
Ans: The priority queue is now:
| T2, 6, 5, -1,
false, -1, -1 |
| a, 2, 5, -1,
false, -1, -1 |
| b, 0, 9, -1,
false, -1, -1 |
Exercise 7: What is the status of the encoding table after the operations
of Exercise 6 have been completed?
Ans: The encoding table is shown in Figure 6
Figure 6: The encoding table after the operations of Exercise 7 have been completed.
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
9 |
-1 |
false |
-1 |
-1 |
| 99 (c) |
1 |
3 |
6 |
true |
-1 |
-1 |
| 97 (a) |
2 |
5 |
-1 |
true |
-1 |
-1 |
| 100 (d) |
3 |
1 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
1 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
2 |
6 |
false |
-1 |
-1 |
| 129 (T2) |
6 |
5 |
-1 |
false |
-1 |
-1 |
Exercise 8: Suppose the priority queue is now at the
state specified in the answer to Exercise 6. What does the queue
look like after
the two smallest elements from the
priority queue (T2 and
a) have been removed and combined?
Ans: The priority queue is now:
| b, 0, 9, -1, false,
-1, -1 |
| T3, 7, 10,
-1, false, -1,
-1 |
Exercise 9:What is the status of the encoding coding table after the operations
of Exercise 8 have been completed?
Ans: The encoding table is shown in Figure 7
Figure 7: The encoding table after the operations of Exercise 8 have been completed.
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
9 |
-1 |
false |
-1 |
-1 |
| 99 (c) |
1 |
3 |
6 |
true |
-1 |
-1 |
| 97 (a) |
2 |
5 |
7 |
true |
-1 |
-1 |
| 100 (d) |
3 |
1 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
1 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
2 |
6 |
false |
-1 |
-1 |
| 129 (T2) |
6 |
5 |
7 |
false |
-1 |
-1 |
| 130 (T3) |
7 |
10 |
-1 |
false |
-1 |
-1 |
Exercise 10: Suppose the priority queue is now at the
state specified in the answer to Exercise 8. What does the queue
look like after
the two smallest elements from the
priority queue (b and
T3) have been removed and combined?
Ans: The priority queue is now:
| T4, 8, 19, -1,
false,
-1, -1 |
Exercise 11:What is the status of the Huffman coding table after the operations
of Exercise 10 have been completed?
Ans: The completed encoding table is shown in Figure 2.
Exercise 12: Suppose the priority queue is now at the
state specified in the answer to Exercise 11. What happens when we attempt
to move two elements?
Ans:
Since the queue only contains one element.
deleteMin
operation of the second element throws a
NoSuchElementExecption
and the algorithm stops. Note: the selection of which node is right
versus left child is arbitrary.
Exercise 13: What are the character encodings generated by
the encoding table of Figure 2?
98(b): 0
99(c): 101
97(a): 11
100(d): 1001
101(e): 1000
Exercise 14: If there are n unique characters in the alphabet, how
many entries will there be in the encoding table?
Ans: The priority queue starts with n entries corresponding to
the unique characters. At each step, 2 entries are removed and a new temporary node
is created and enqueued. Thus, at each step, the priority queue has one fewer elements.
This process is repeated n-1 times, thus n - 1 temporary nodes are created.
Thus the total number of nodes in the tree is n + (n - 1) or 2*n-1.
Figure 8 shows the elementation of Step II of the Huffman encoding algorithm.
Figure 8: Implementation of Step II of the Huffman encoding algorithm
private void buildTree() {
int nextInternalValue = MAXIMUM_UNIQUE_CHARACTERS; // 128
PriorityQueueADT q = new LinkedPriorityQueue();
for (int i = 0; i < alphabetSize; i++)
q.add(encodingTable[i]);
try {
while (true) {
HuffmanCodingEntry T1 = (HuffmanCodingEntry) (q.removeMin());
HuffmanCodingEntry T2 = (HuffmanCodingEntry) (q.removeMin());
T1.setParent(currentTableSize);
T2.setParent(currentTableSize);
T2.setChildType(true);
insertInTable(nextInternalValue, T1.getWeight() + T2.getWeight());
q.add(encodingTable[currentTableSize - 1]);
nextInternalValue++;
}
} catch (NoSuchElementException e) { // expected to happen at end
}
}
Format
for the encoded file
Each entry in the encoding/decoding table is represented by a
HuffmanCodingEntry
object. The HuffmanEncoder computes the encoding table from the
original file. The HuffmanDecoder does not have access to the
original file, so it must read the encoding table from the encoded file.
The format for representing the encoding table at the beginning of the encoded file is:
- Total number of bytes in original file (an int).
- Number of characters in the alphabet (an int).
- The entries of the decoding table (2*n - 1 entries where n is the number of characters in the alphabet)
- Huffman-encoded bitstream for the file ....
Figure 9 shows the implementation of the code of Step III of the Huffman encoding algorithm.
Figure 9: Code to implement Step III of the encoding algorithm
private void writeEncodedFile() throws IOException {
BitInputStream f = null;
try {
f = new BitInputStream(unencodedFileName);
encodedFile.writeInt(unencodedFileSize);
encodedFile.writeInt(alphabetSize);
for (int i = 0; i < currentTableSize; i++)
encodedFile.writeInt(encodingTable[i].encode());
// Read one character at a time, encode it, and write to file
int thisChar = 0;
for (int j = 0; j < unencodedFileSize; j++) {
if ( (thisChar = f.readByte()) == -1)
throw new IOException("Unexpected end of file reading character "
+ j + " in " + unencodedFileName);
writeEncodedChar(thisChar, encodedFile);
}
encodedFile.flush();
} catch (Exception e) {
throw new IOException("Error encoding file: " + e.getMessage());
} finally { // No matter what we want to release system resources
encodedFile.close();
if (f != null)
f.close();
}
}
private void writeEncodedChar(int thisChar, BitOutputStream bout) throws IOException {
int theEntry = invertedTable[thisChar]; // Find the position in the table
int theParent = encodingTable[theEntry].getParent();
if (theParent == -1)
return;
writeEncodedChar(encodingTable[theParent].getValue(), bout);
bout.writeBit(encodingTable[theEntry].isRightChild());
}
Implementing the Huffman
Decoding Process
Laboratory 11
The HuffmanDecoder uses the same coding table as the HuffmanEncoder.
Notice that all of the weight values are zero since the HuffmanDecoder
doesn't need them. The encoded file contains an int that
represents the childType, value, and parent of each entry.
The decoded file that is read for the bcbbbbbbaacaabbcade example is shown in Figure 10.
Figure 10: The table read for decoding bcbbbbbbaacaabbcade.
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
0 |
8 |
false |
-1 |
-1 |
| 99 (c) |
1 |
0 |
6 |
true |
-1 |
-1 |
| 97 (a) |
2 |
0 |
7 |
true |
-1 |
-1 |
| 100 (d) |
3 |
0 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
0 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
0 |
6 |
false |
-1 |
-1 |
| 129 (T2) |
6 |
0 |
7 |
false |
-1 |
-1 |
| 130 (T3) |
7 |
0 |
8 |
true |
-1 |
-1 |
| 131 (T4) |
8 |
0 |
-1 |
false |
-1 |
-1 |
The decoding process starts at the root (last entry) of the decoding tree. It reads
a bit from the input file. If the bit is 1, the decoder moves
to the right child, if 0 to the left child. The process continues
until the leaf is reached. The unencoded byte is then output. Since the
direction is down the tree from the root, the decoder must initially set
up the leftChild and rightChild entries of the table.
This can be done by stepping through the table and setting either the leftChild
or rightChild entry of the parent based on the childType
of the node.
Algorithm for makeDecodingTable (Part III of the laboratory 11)
- Read in the decoding table
- Read file size
- Read alphabet size
- Calculate table size
- For each entry in the table
- Read the int that represents the childtype, value,
and parent.
- Use this int to instantiate a HuffmanCodingEntry.
(Use the second constructor).
- Set the children pointers in the decoding table
- For each entry in the table starting with index 0:
(Do not include the last one because it is the root)
- Get the parent of this entry
- If the current entry is a right child, set the right child
of the parent to this entry
- Otherwise set the left child of
the parent to this entry
- Set the parent of the last entry to -1.
The result of makeDecodingTable for file containing the encoded version of bcbbbbbbaacaabbcade is shown
in Figure 11.
Figure 11: The result of makeDecodingTable for the file containing the encoded version of
bcbbbbbbaacaabbcade
| value |
index |
weight |
parent |
childType |
leftChild |
rightChild |
| 98 (b) |
0 |
0 |
8 |
false |
-1 |
-1 |
| 99 (c) |
1 |
0 |
6 |
true |
-1 |
-1 |
| 97 (a) |
2 |
0 |
7 |
true |
-1 |
-1 |
| 100 (d) |
3 |
0 |
5 |
true |
-1 |
-1 |
| 101 (e) |
4 |
0 |
5 |
false |
-1 |
-1 |
| 128 (T1) |
5 |
0 |
6 |
false |
4 |
3 |
| 129 (T2) |
6 |
0 |
7 |
false |
5 |
1 |
| 130 (T3) |
7 |
0 |
8 |
true |
6 |
2 |
| 131 (T4) |
8 |
0 |
-1 |
false |
0 |
7 |
To decode a Huffman-encoded bit string, start at the root of the Huffman
tree and use the input bits to determine the path to the leaf.
Algorithm for writeUnencodedFile:
For each character in the encoded file:
-
Start at the root of the tree (entry currentTableSize - 1 in the table).
-
Do
-
Read the bit.
-
If the bit is a 0 take the left branch.
-
If the bit is a 1 take the right branch.
Until child is -1 (leaf).
-
Output the leaf's byte value to the file.
Classes Provided for the Implementation
-
BitInputStream
- reads single bits, bytes and integers from a file. (These values can be interspersed,
but always start on an even byte boundary).
-
BitOutputStream
- writes single bits, bytes and integers to a file. (These values can be interspersed,
but always start on an even byte boundary).
-
HuffmanCodingEntry
- used to hold the encoding for a single character in the Huffman table. This
object is used in both encoding and decoding. This class
implements Comparable so objects of type HuffmanCodingEntry
can be put in a priority
queue.
-
HuffmanCodingEntry has a constructor that takes a table
position and an int and decodes the int into the object.
-
HuffmanCodingEntry also has an encode method
that produces the
int encoding of the entry for use by the HuffmanEncoder.
The
entry coding only keeps the value of the entry, the parent and the child
type --- the information that is needed by the HuffmanDecoder.
-
HuffmanEncoder encodes
a text file