Huffman Coding
A Case Study in Lossless Compression
Using Variable Length Coding

Overview

File compression, particularly for multimedia data, is widely used to reduce Internet traffic and transfer times. Two common compression formats for images are GIF and JPEG. Both of these encoding formats throw away information about the images, so the original image can not be reconstructed exactly from the compressed image. GIF and JPEG are lossy compression techniques. Lossy compression can be very effective for multimedia data. JPEG encoding, for example, can reduce the size of an image by a factor of 20 or more without a noticeable loss of image quality.

Lossy compression cannot be used for text and data files, because you want to get the original file back when you uncompress it (i.e. lossless compression). Many lossless compression schemes use variable length encoding. One of the earliest and most commonly used is Huffman coding.

Every file can be thought of as a sequence of bytes (values from 0 to 255). Uncompressed files use 8 bits for each possible byte value. The idea of Huffman coding is simply to devise an encoding scheme so that the most frequently occurring byte values are represented by a short code (fewer bits) and the less frequently occurring byte values are represented by a longer code (more bits). Huffman encoding significantly reduces the size of files, but it requires that you be able to read or write a bit at a time.

The algorithm for Huffman coding generates a binary tree whose left and right branches are labeled by 0 and 1 respectively as shown in the diagram below. The leaves of the tree are unique bytes that appear in the file (the alphabet). The path from the root to the leaf gives the encoding of the byte represented by that node. Leaves that are close to the root have short encodings and leaves that are farther away have longer encodings. The trick is to generate a tree where the most frequently occurring byte values are placed in leaves close to the root.

Example 1: The string bcbbbbbbaacaabbcade has the Huffman tree shown in Fig. 1.


Figure 1: Image of a Huffman Tree representing bcbbbbbbaacaabbcade. The number in each node represents the index into Huffman index table.
    a:  11
    b:  0
    c:  101
    d:  1001
    e:  1000

This case study implements the Huffman encoding and decoding algorithms. You should download and unzip Huffmanoriginal.zip as the starting point for this project. The Huffman encoding is not unique. The exact version of the encoding depends on how the priority queue treats ties. Be sure to set the required library of the project to the queues/classes directory in this project so that your runs agree with those in the handout.

Huffman Decoding
To decode a Huffman-encoded bit string, start at the root of the Huffman tree and use the input bits to determine the path to the leaf: This is done in the method writeUnencodedFile in HuffmanDecoder. Exercise 1: Use the Huffman tree of Figure 1 to decode the bit string 0001101011000.
Ans: bbbabce.

The implementation details for Huffman decoding are covered later in the case study.
Huffman Encoding

The Huffman tree of Figure 1 is generated by making an initial pass through the data file to determine the number of times each character in the alphabet appears in the file. The tree can be easily generated from a table of these frequencies. The Huffman tree is constructed so that the leaves representing the most frequently appearing bytes are closest to the root of the tree.

Each character in the input file is encoded by a bit pattern. The encoding works from the leaves of the Huffman tree upward. Start at the leaf node representing the next character (byte) in the original file. Follow the path up to the root to get the encoding. In order to implement the strategy efficiently, you will need a parent pointer representation of the tree. The bits come off in reverse order, so you will need to do the traversal recursively, save the bits on a stack, or prepend rather than append the bits to a string.

Exercise 2: Encode the file containing bcbbbbbbaacaabbcade using the Huffman tree above.
Ans: The answer is 010100000011111011111001011110011000. If fact, the Huffman tree was built from this file.

Algorithm for Huffman Encoding

Step I: Generate the table of frequencies
Make one pass through the original file to determine:

Step II: Build the encoding tree

Step III: Encoding the characters
For each character in the input file, find the leaf node corresponding to that character and then traverse the tree to its root, prepending the bits for that character.
The above algorithm is called a greedy algorithm because it always picks the easiest (smallest weight) nodes. It can be proved that this greedy algorithm always results in an optimal tree, but this proof is a topic for Analysis of Algorithms.

Implementation of the encoding process

To implement the Huffman encoding algorithm, we will need a tree representation with parent pointers. One possibility is to modify the BinaryTreeNode to include a parent pointer. However, the Huffman encoding tree has a special form. We will show later that if the string has n unique letters, the Huffman encoding tree has 2*n - 1 nodes. For trees with of a bounded size, an array representation of the tree is very efficient. If we are only encoding documents that contain the standard ASCII character set, we can have at most 128 unique letters (corresponding to byte values 0 through 127). Therefore the Huffman encoding tree for ASCII documents always fits into a table of size 255.

Figure 2 represents the Huffman encoding tree of Figure 1 in an array. The parent column gives the index of the parent in the encoding tree. The childType is true if this node is a right child of its parent and false otherwise. The leftChild and rightChild are used later in the decoding.

Figure 2: The Huffman encoding table for the tree of Figure 1.
value index weight parent childType leftChild rightChild
98 (b) 0 9 8 false -1 -1
99 (c) 1 3 6 true -1 -1
97 (a) 2 5 7 true -1 -1
100 (d) 3 1 5 true -1 -1
101 (e) 4 1 5 false -1 -1
128 (T1) 5 2 6 false -1 -1
129 (T2) 6 5 7 false -1 -1
130 (T3) 7 10 8 true -1 -1
131 (T4) 8 19 -1 false -1 -1

One of the difficulties with the table representation, is that during encoding, we have to do a linear search for each letter in the input string in order to find its position in the Huffman encoding table. We will use the inverted table as shown in Figure 3 to speed up the lookup process. The index entry for each ASCII character appears in the same index position as its ASCII code. For example, 'b' has ASCII code 98, so it appears in position 98 of the inverted table. Then entry in position 98 is 0, corresponding to the index into the encoding table. We only need entries for the possible input characters, not for the temporary nodes in the Huffman tree, so the inverted table only has 128 entries. Databases frequently use inverted tables to speed lookup.

Figure 3: The index or inverted table for the Huffman encoding table of Figure 2.
Index Entry (Index in Huffman encoding table)
0 -1
1 -1
2 -1
... ...
97 (a) 2
98 (b) 0
99 (c) 1
100 (d) 3
101 (e) 4
... ...
127 -1

Format for the Huffman Table

Encoding and decoding will be done using an encoding table implemented as an array of HuffmanCodingEntry objects that have the following fields: For encoding, only the value, weight, parent and childType entries are relevant.

Initial scan to compute character counts (Step I of the encoding algorithm)

The computeCharacterCounts method of Figure 4 makes the initial pass through the document to compute the character counts (Part I of the algorithm). Before computeCharacterCounts is called, the childType of each entry is initialized to zero(false) and the parent, leftChild and rightChild values of each entry are initialized to -1. The currentTableSize gives the number of unique characters encountered so far in the document.

Figure 3: Code to implement Step I of the Huffman encoding algorithm
   private void computeCharacterCounts() throws IOException {
      FileInputStream f = null; // could also use BitInputStream here
      try { // we catch and rethrow the exception to be sure to close file
         f = new FileInputStream(unencodedFileName);
         int thisByte;
         while ( (thisByte = f.read()) != -1) {
            insertInTable(thisByte, 1);
            unencodedFileSize++;
         }
      } catch (Exception e) {
         throw new IOException("Exception " + e.getMessage() +
                               "reading " + unencodedFileName);
      } finally { // always want to close the file no matter what
         if (f != null)
            f.close();
      }
      alphabetSize = currentTableSize; // now have the unique characters
   }

   private void insertInTable(int v, int weight) {
      if (invertedTable[v] == -1) { // if v is not in the table add it
         encodingTable[currentTableSize] =
               new HuffmanCodingEntry(v, currentTableSize, weight, -1, false);
         invertedTable[v] = currentTableSize;
         currentTableSize++;
      } else // add weight to the total weight of the character v
         encodingTable[invertedTable[v]].addWeight(weight);
   }

Constructing the non-leaf portion of the encoding tree (Step II of the Huffman encoding algorithm)

After a table of character frequencies has been computed, we need to construct a binary encoding tree in the table by setting the parent entries. The leaf nodes are just the entries in the table above. Each nonleaf node has a weight that is the sum of the weights of its children. The tree is constructed so that any node at a particular level in the tree has weights that are at least as great as nodes at lower levels in the tree.

Exercise 3: After the table of frequencies have been computed, the entries are added to a priority queue. What will that queue contain?
Ans:
e, 4, 1, -1, false, -1, -1
d, 3, 1, -1, false, -1, -1
c, 1, 3, -1, false, -1, -1
a, 2, 5, -1, false, -1, -1
b, 0, 9, -1, false, -1, -1

Exercise 4: After the first two elements of the priority queue have been removed and a new node whose weight is the sum of their weights is added, what will the priority queue look like?
Ans: After removing the two smallest elements (e and d) and creating a new element, T1 whose weight is 2, the priority queue contains.
T1, 5, 2, -1, false, -1, -1
c, 1, 3, -1, false, -1, -1
a, 2, 5, -1, false, -1, -1
b, 0, 9, -1, false, -1, -1

Exercise 5: What does the encoding table look like after the operations of Exercise 4 have been performed?
Ans: A new node containing T1 has been added as shown in Figure 5. Notice that the parent and childType fields of the objects representing d and e have been modified to reflect the fact that d and e are now children of T1.

Figure 5: The Huffman Table after the nodes containing d and e have been processed.
value index weight parent childType leftChild rightChild
98 (b) 0 9 -1 false -1 -1
99 (c) 1 3 -1 false -1 -1
97 (a) 2 5 -1 false -1 -1
100 (d) 3 1 5 true -1 -1
101 (e) 4 1 5 false -1 -1
128 (T1) 5 2 -1 false -1 -1

Exercise 6: Suppose the priority queue is now in the state specified in the answer to Exercise 4. What does the queue look like after the two smallest elements from the priority queue (T1 and c) have been removed and combined?
Ans: The priority queue is now:
T2, 6, 5, -1, false, -1, -1
a, 2, 5, -1, false, -1, -1
b, 0, 9, -1, false, -1, -1

Exercise 7: What is the status of the encoding table after the operations of Exercise 6 have been completed?
Ans: The encoding table is shown in Figure 6

Figure 6: The encoding table after the operations of Exercise 7 have been completed.
value index weight parent childType leftChild rightChild
98 (b) 0 9 -1 false -1 -1
99 (c) 1 3 6 true -1 -1
97 (a) 2 5 -1 true -1 -1
100 (d) 3 1 5 true -1 -1
101 (e) 4 1 5 false -1 -1
128 (T1) 5 2 6 false -1 -1
129 (T2) 6 5 -1 false -1 -1

Exercise 8: Suppose the priority queue is now at the state specified in the answer to Exercise 6. What does the queue look like after the two smallest elements from the priority queue (T2 and a) have been removed and combined?
Ans: The priority queue is now:
b, 0, 9, -1, false, -1, -1
T3, 7, 10, -1, false, -1, -1
Exercise 9:What is the status of the encoding coding table after the operations of Exercise 8 have been completed?
Ans: The encoding table is shown in Figure 7

Figure 7: The encoding table after the operations of Exercise 8 have been completed.
value index weight parent childType leftChild rightChild
98 (b) 0 9 -1 false -1 -1
99 (c) 1 3 6 true -1 -1
97 (a) 2 5 7 true -1 -1
100 (d) 3 1 5 true -1 -1
101 (e) 4 1 5 false -1 -1
128 (T1) 5 2 6 false -1 -1
129 (T2) 6 5 7 false -1 -1
130 (T3) 7 10 -1 false -1 -1

Exercise 10: Suppose the priority queue is now at the state specified in the answer to Exercise 8. What does the queue look like after the two smallest elements from the priority queue (b and T3) have been removed and combined?
Ans: The priority queue is now:
T4, 8, 19, -1, false, -1, -1
Exercise 11:What is the status of the Huffman coding table after the operations of Exercise 10 have been completed?
Ans: The completed encoding table is shown in Figure 2.

Exercise 12: Suppose the priority queue is now at the state specified in the answer to Exercise 11. What happens when we attempt to move two elements?
Ans: Since the queue only contains one element. deleteMin operation of the second element throws a NoSuchElementExecption and the algorithm stops. Note: the selection of which node is right versus left child is arbitrary.

Exercise 13: What are the character encodings generated by the encoding table of Figure 2?
98(b):    0 
99(c):    101 
97(a):    11 
100(d):   1001 
101(e):   1000
Exercise 14: If there are n unique characters in the alphabet, how many entries will there be in the encoding table?
Ans: The priority queue starts with n entries corresponding to the unique characters. At each step, 2 entries are removed and a new temporary node is created and enqueued. Thus, at each step, the priority queue has one fewer elements. This process is repeated n-1 times, thus n - 1 temporary nodes are created. Thus the total number of nodes in the tree is n + (n - 1) or 2*n-1.

Figure 8 shows the elementation of Step II of the Huffman encoding algorithm.

Figure 8: Implementation of Step II of the Huffman encoding algorithm
   private void buildTree() {
      int nextInternalValue = MAXIMUM_UNIQUE_CHARACTERS; // 128 
      PriorityQueueADT q = new LinkedPriorityQueue();
      for (int i = 0; i < alphabetSize; i++)
         q.add(encodingTable[i]);
      try {
         while (true) {
            HuffmanCodingEntry T1 = (HuffmanCodingEntry) (q.removeMin());
            HuffmanCodingEntry T2 = (HuffmanCodingEntry) (q.removeMin());
            T1.setParent(currentTableSize);
            T2.setParent(currentTableSize);
            T2.setChildType(true);
            insertInTable(nextInternalValue, T1.getWeight() + T2.getWeight());
            q.add(encodingTable[currentTableSize - 1]);
            nextInternalValue++;
         }
      } catch (NoSuchElementException e) { // expected to happen at end
      }
   }

Format for the encoded file

Each entry in the encoding/decoding table is represented by a HuffmanCodingEntry object. The HuffmanEncoder computes the encoding table from the original file. The HuffmanDecoder does not have access to the original file, so it must read the encoding table from the encoded file. The format for representing the encoding table at the beginning of the encoded file is:

Figure 9 shows the implementation of the code of Step III of the Huffman encoding algorithm.

Figure 9: Code to implement Step III of the encoding algorithm

   private void writeEncodedFile() throws IOException {
      BitInputStream f = null;
      try {
         f = new BitInputStream(unencodedFileName);
         encodedFile.writeInt(unencodedFileSize);
         encodedFile.writeInt(alphabetSize);
         for (int i = 0; i < currentTableSize; i++)
            encodedFile.writeInt(encodingTable[i].encode());
            // Read one character at a time, encode it, and write to file
         int thisChar = 0;
         for (int j = 0; j < unencodedFileSize; j++) {
            if ( (thisChar = f.readByte()) == -1)
               throw new IOException("Unexpected end of file reading character "
                                     + j + " in " + unencodedFileName);
            writeEncodedChar(thisChar, encodedFile);
         }
         encodedFile.flush();
      } catch (Exception e) {
         throw new IOException("Error encoding file: " + e.getMessage());
      } finally { // No matter what we want to release system resources
         encodedFile.close();
         if (f != null)
            f.close();
      }
   }

   private void writeEncodedChar(int thisChar, BitOutputStream bout) throws IOException {
      int theEntry = invertedTable[thisChar]; // Find the position in the table
      int theParent = encodingTable[theEntry].getParent();
      if (theParent == -1)
         return;
      writeEncodedChar(encodingTable[theParent].getValue(), bout);
      bout.writeBit(encodingTable[theEntry].isRightChild());
   }


Implementing the Huffman Decoding Process
Laboratory 11

The HuffmanDecoder uses the same coding table as the HuffmanEncoder. Notice that all of the weight values are zero since the HuffmanDecoder doesn't need them. The encoded file contains an int that represents the childType, value, and parent of each entry. The decoded file that is read for the bcbbbbbbaacaabbcade example is shown in Figure 10.

Figure 10: The table read for decoding bcbbbbbbaacaabbcade.
value index weight parent childType leftChild rightChild
98 (b) 0 0 8 false -1 -1
99 (c) 1 0 6 true -1 -1
97 (a) 2 0 7 true -1 -1
100 (d) 3 0 5 true -1 -1
101 (e) 4 0 5 false -1 -1
128 (T1) 5 0 6 false -1 -1
129 (T2) 6 0 7 false -1 -1
130 (T3) 7 0 8 true -1 -1
131 (T4) 8 0 -1 false -1 -1

The decoding process starts at the root (last entry) of the decoding tree. It reads a bit from the input file. If the bit is 1, the decoder moves to the right child, if 0 to the left child. The process continues until the leaf is reached. The unencoded byte is then output. Since the direction is down the tree from the root, the decoder must initially set up the leftChild and rightChild entries of the table. This can be done by stepping through the table and setting either the leftChild or rightChild entry of the parent based on the childType of the node.

Algorithm for makeDecodingTable (Part III of the laboratory 11)

The result of makeDecodingTable for file containing the encoded version of bcbbbbbbaacaabbcade is shown in Figure 11.

Figure 11: The result of makeDecodingTable for the file containing the encoded version of bcbbbbbbaacaabbcade
value index weight parent childType leftChild rightChild
98 (b) 0 0 8 false -1 -1
99 (c) 1 0 6 true -1 -1
97 (a) 2 0 7 true -1 -1
100 (d) 3 0 5 true -1 -1
101 (e) 4 0 5 false -1 -1
128 (T1) 5 0 6 false 4 3
129 (T2) 6 0 7 false 5 1
130 (T3) 7 0 8 true 6 2
131 (T4) 8 0 -1 false 0 7

To decode a Huffman-encoded bit string, start at the root of the Huffman tree and use the input bits to determine the path to the leaf.

Algorithm for writeUnencodedFile:

For each character in the encoded file:

Classes Provided for the Implementation