Bricolage: Data Compression

Bricolage: Data Compression

Bricolage: Data Compression

You are probably familiar with Unix compress, gzip, or bzip2 utilities, or the DOS pkzip utility. These programs all make files smaller; we say that such files are compressed. Compressed files take less disk space and less network bandwidth. The downside of compressed files is that it they are full of unreadable gibberish; you usually have to run another program to uncompress them before you can use them again. In this article we'll see how file compression works, and I'll show a simple module that includes functions for compressing and uncompressing files.

Morse Code

The idea behind data compression is very simple. In a typical file, say a text file, every character takes up the same amount of space: 8 bits. The letter e is represented by the 8 bits 01100101; the letter Z is represented by the 8 bits 010101010. But in a text file, e occurs much more frequently than Z---maybe about 75 times as frequently. If you could give the common symbols short codes and the uncommon symbols long codes, you'd have a net gain.

This isn't a new idea. It was exploited by Samuel Morse in the Morse Code, a very early digital data transmission protocol. Morse Code was designed to send text files over telegraph wires. A telegraph is very simple; it has a switch at one end, and when you close the switch, an electric current travels through a wire to the other end, where there is a relay that makes a click. By tapping the switch at one end, you make the relay at the other end click. Letters and digits are encoded as sequences of short and long clicks. A short click is called a dot, and a long click is called a dash.

The two most common letters in English text are E and T; in Morse code these are represented by a single dot and a single dash, respectively. The codes for I, A, N, and M, all common letters, are **, *-, -*, and --. In contrast, the codes for the uncommon letters Q and Z are --*- and --**.

In computer file compression, we do a similar thing. We analyze the contents of the data, and figure out which symbols are frequent and which are infrequent. Then we assign short codes to the frequent symbols and long codes to the infrequent symbols. We write out the coded version of the file, and that usually makes it smaller.

Ambiguous codes

There's a problem with Morse Code: You need a third symbol, typically a long pause, to separate the dots and dashes that make up one letter from the dats and dashes that make up the next. Otherwise, if you get *-, you don't know whether it's the single letter A or the two letters ET---or it might be the first bit of the letter R or L. In a long message, all the dots and dashes run together and you get a big mess that can't be turned back into text. In Morse code, it can be hard to tell `Eugenia' from `Sofia': Without the interletter pauses, they're both:

                ***---**-****-

Those interletter spaces take up a lot of transmission time, and it would be nice if you didn't need them. It turns out that if you arrange the code properly, you don't. The ambiguity problem with Morse Code occurs because some codes are prefixes of others: There are some letters where the code for the first part of one letter is just the same as the code for the other letter, but with something extra tacked on. When you see the shorter code, you don't know if it's complete or if it should be combined with the following symbols. It turns out that if no code is a prefix of any other, then the code is unambiguous.

Suppose for simplicity that we only needed to send the letters A, C, E, and S over the telegraph. Instead of Morse code, we could use the following code table:

        A       -
        C       **
        E       *-*
        S       *--

Suppose we receive the message -*****-**--*--*-**--. What was the message? Well, the first symbol is -, so the first letter in the message must be A, because that's the only letter with a code that starts with a -. Then the next two symbols are **, so the second letter must be a C, because all the other codes that start with * have a - after the * instead of another *. Similar reasoning shows that the third letter is also C. After that, the code is *-*; it must be an E. We continue through the message, reading off one letter at a time, and eventually we get the whole thing this way.

It's so simple that a computer can decode it, if the computer is equipped with a decision tree like this one:

Start at Start, and look at the symbols in the message one by one. At each stage, follow the appropriate labelled branch to the next node. If there's a letter at that node, output the letter and go back to the start node. If there's no letter at the node, look at the next symbol in the input and continue down the tree towards the leaves.

Huffman Coding

Obviously, it's important to choose the right code. If Morse had made the * code for Z and **-* the code for E, he wouldn't be famous.

Choosing the right code can be tricky. Consider the example of the previous section, where we only had to code messages that contain A, C, E, and S. The code I showed is good when we expect our messages to contain more A's than E's or S's. If S were very common, we clearly could have done better; less clearly, if all four letters were about equally common, then we could still have done better, by assigning each letter a code of the same length:

Suppose, for example, our message happened to contain 200 of each of the four letters. Then the first code would use 1800 symbols, and the second code would use only 1600.

In 1952, David Huffman discovered a method for producing the optimal unambigous code. For a given set of symbols, if you know the probability with which each symbol appears in the input, you can use Huffman's method to construct an unambiguous code that encodes the typical message with fewer *'s and -'s than any other code.

The method is very simple and ingenious. For concreteness, let's suppose that the (rather silly) message is

        THE_THIRSTIEST_SISTERS_TEETH_RESIST_THIS_STRESS

(I used _ instead of space so that it'll be easier to see.)

Start with the table of relative probabilities; you can get this by counting the number of occurrences of every symbol in the message. This is called histogramming. (A histogram is a bar chart; histos is Greek for a beam or a mast.) Here's the histogram for the symbols in our sample message:

S 11

T 10

E 7

_ 6

I 5

H 4

R 4

Now take the two least common entries in the table, that's H and R. They'll get the longest codes, because they're least common. We'll simplify this by pretending that H and R are the same, and lumping them together into one category, which we'll call HR. Then we'll assign codes to all the other letters and to HR. When we're done, we still have to distinguish between H and R. Now, HR has some code. We don't know what it is yet, so let's symbolize it with <HR>. We don't really need to use <HR> in our message, because the is no such thing as the letter HR, so we'll split it in two, and let the code for H be <HR>* and the code for R be <HR>-. As a result of this, the codes for H and R will be longer than the codes for the other letters, but if that has to appen, it's better for it to happen for H and R, because they are the least common letters in the message.

So we will lump H and R together and pretend temporarily that they are only one letter. Our table then looks like this:

        S       11        R = <HR>-
        T       10        H = <HR>*
        HR       8
        E        7
        _        6
        I        5

Now we repeat the process. The two least common symbols are I and _. We'll lump them together into a new `symbol' called I_, we'll assign finish assigning the codes to S, T, HR, E, and I_. When we're done, I will get the code <I_>* and _ will get the code <I_>-.

        S       11        R = <HR>-
        I_      11        H = <HR>*
        T       10        _ = <I_>-
        HR       8        I = <I_>*
        E        7

Then we lump together HR and E:

        HRE     15        R   = <HR>-
        S       11        H   = <HR>*
        I_      11        _   = <I_>- 
        T       10        I   = <I_>*
                          HR  = <HRE>- 
                          E   = <HRE>*

Then we lump together T and I_:

        I_T     21        R   = <HR>-
        HRE     15        H   = <HR>*
        S       11        _   = <I_>- 
                          I   = <I_>*
                          HR  = <HRE>- 
                          E   = <HRE>*
                          I_  = <I_T>-
                          T   = <I_T>*

Then we lump together S and HRE:

        SHRE    25        R   = <HR>-
        I_T     21        H   = <HR>*
                          I   = <I_>- 
                          _   = <I_>*
                          HR  = <HRE>- 
                          E   = <HRE>*
                          I_  = <I_T>-
                          T   = <I_T>*
                          S   = <SHRE>-
                          HRE = <SHRE>*

Now we only have two `symbols' left. There's only one way to assign a code to two symbols; one of them gets * and the other gets -. It doesn't matter which gets which, so let's say that SHRE gets * and I_T gets -.

Now the codes fall out of the table we've built up in the right-hand column:

	SHRE = * 
		S = *-
	        HRE = **
	                HR = **-
	                        R = **--
	                        H = **-*
	                E  = ***
	I_T = - 
		I_ = -- 
			_ = --- 
			I = --* 
		T = -*

We throw away the codes for the fictitious compound symbols, and we're left with the real code:

        S = *-
        T = -*
        E = ***
        _ = ---
        I = --*
        H = **-*
        R = **--

As promised, the code is unambiguous, because no code is a prefix of any other code. Our original message encodes like this:

        -***-****----***-*--***--*--*--*****--*--- 
        *---**--******--*-----*******-***-*---     
        **--****---**--*----***-*--**----          
        *--***--****-*-

For a total of 128 *'s and -'s, an average of 2.72 symbols per character, and a 9.3% improvement over the 141 symbols we would have had to use if we had given every letter a three-symbol code.

The Code

For this article, I implemented a demonstration module that compresses files. You can retrieve it from the perl Journal web site at http://www.tpj.com/ or from my Perl Paraphernalia web site at http://www.plover.com/~mjd/perl/Huffman/. The program htest.pl compresses an input and saves it to the file /tmp/htest.out; then it opens this file, reads in the compressed data, decompresses it, and prints the result to standard output.

Most of the real work is in the Huffman module that htest uses. Here are the important functions that htest calls:

        my $hist = Huffman::histogram(\@symbols);

This generates the histogram of the input text. The histogram is the tally of the number of occurrences of each symbol. The argument to histogram is an array of symbols, passed by reference for efficiency, because it is likely to be very large. It might have been simpler to pass the input as a single string, but this way we can assign codes per-word instead of per-character if we want to, just by splitting the input into an array in a different way.

        my $codes = Huffman::tabulate($hist);

The tabulate function generates the code table from thie histogram using Huffman's method. (This has the side effect of destroying the histogram.) The code table is just a hash that maps symbols to codes; the codes themselves are strings like 0010011.

        Huffman::save_code_table(*FILE, $codes);

This writes the code table to the file.

        Huffman::encode(*FILE, $codes, \@symbols);

This encodes the input text and writes the result to the file.

        my $codes = Huffman::load_code_table(*FILE);
        my $coded_data = <FILE>;
        my $text = Huffman::decode($codes, $coded_data);

This is the decompression process. The return value from load_code_table is a code table in a rather interesing form. It's a decision tree, just like the ones we saw earlier in the article. Each node of the tree is either a single string (which is the symbol that the decoder should output) or it's an array with two elements; the 0th element is the part of the tree on the branch labeled 0, and the 1st element is the part of the tree on the branch labeled 1. For example, taking * as equivalent to 0 and - as equivalent to 1, the decision tree from earlier in the article would be represented like this:

The Rub

We can compress the file, but unless we include the code table in the compressed file, we won't be able to uncompress it again. Files that can't be uncompressed are not very useful. (Sometimes you don't need to be able to decompress the file; in these cases the Perl unlink function implements a much simpler and more efficient form of compression.) But sometimes the code table is bigger than the savings that we got from compressing the file, especially if the original file is small. On a sample file of comp.lang.perl.misc articles, the compressor reduced the file size by 32%, from 42733 bytes to 29114. But when I ran it on its own source code, the file size increased from 987 bytes to 1321. The compressed data was only 631 bytes long, but the code table and other overhead took up 690 bytes.

Another Rub

Huffman coding is optimal in the sense of compressing a given set of symbols into the smallest space possible. But if you readjust your idea of a `symbol', you can get must better compression.

Suppose you're compressing English text. If you histogram the characters, and use Huffman's method, you'll get the optimal way to encode the characters. Because each character has its own code, your code would also serve to code arbitrary random gibberish. It should be clear that this expressiveness has a price. For English text, we don't need so much expressiveness; it's clearly more efficient to assign a code to each word instead of to each character. In doing so, we lost the ability to encode random gibberish, but our compressed data becomes much smaller. Suppose there are about 2**17 words in English; if we assign each one a different 17-bit code, we can then encode our original seven-word message:

        THE THIRSTIEST SISTERS TEETH RESIST THIS STRESS

into 7*17 = 119 bits, alreasy an improvement on the 128 we used before. And if we use Huffman's method to assign long codes to rare words and short codes to common words, we can do even better.

Using this method, I compressed the comp.lang.perl.misc articles by 63%. Unfortunately, the code table blew up as a result, and took up 43739 bytes, which was larger than the original uncompressed data.

Other Methods

Most modern data compression doesn't use Huffman coding directly. A better method was proposed in 1977 and 1978 by Jakob Ziv and Abraham Lempel. The Lempel-Ziv methods scan the input data, and when they find a substring that occurs twice, the replace the second occurrence with a reference back to the first. The references can then be Huffman-compressed.

These methods have several advantages over the basic scheme I implemented. One important benefit of Lempel-Ziv compression methods is that they don't have to read and analyze the entire input in advance; they can construct a code table as they go along, basing it on only the portion of the input seen so far. This is important for practical applications. The data compression in your modem wouldn't be very useful if the modem had to gather and analyze a week's worth of traffic before it could send or receive anything.

Another benefit of this method is that because the code table is imlpicit in the input seen so far, the code table doesn't need to be recorded explicitly. The decoding process can infer the code table from the coded data itself. This avoids the embarrassing inflamations that we saw where the code table took up more space than was actually saved by the compression process.

One extra payoff of building the code table on the fly is that if the algorithm notices that the code table it's using isn't performing well, it can throw it away and start over in the middle. If the data is in several pieces that have very different characters, say a graphic image embedded in a text file, this method will be a win over straight Huffman coding because it'll be able to apply an appropriate code table to each part of the data. LZW, the compression algorithm used by the DOS lha program, doesn't do this, but the variation of it employed by the Unix compress program does. The algorithm used by the GNU project's gzip and zlib is different but also periodically throws away the code tables and starts over.

The best thing about Lempel-Ziv and related methods is that they don't need to decide in advance how to break up the input into symbols. LZW, for example, puts any string that it hasn't seen before into the code table, under the assumption that if it appeared once, it'll probably appear again. As the input comes in, longer and longer substrings of the input go into the code table; long substrings go in only when their shorter prefixes have appeared multiple times. This means that if the file is best broken up into words, the algorithm will eventually detect that; if it is best broken up into single characters, the algorithm will detect that too.

Other Directions

If you want an easy but possibly amusing project, try altering the code table functions in the module so that the code tables are smaller. At present, the module does not assume that the coding is done per-character, so you shouldn't break that.

Actually, though, assigning the codes per-word seems to blow up the code table more than it saves. I haven't found a case yet where it's worthwhile. I'd be interested in seeing more analysis of this.

Finally, the decision-tree data structure in the decoder is probably over-clever. It would be simpler to represent the same information with a state table, this way:

                         0        1
                --------------------
                0        1        A
                1        C        2
                2        E        S

Each internal node in the decision tree is assigned a state, which is a number; the root node gets the number 0. To decode, the decoder keeps track of the state number, which says what internal node it's currently at. Each time it reads a bit, it looks in the table to decide where to go next. If it sees a symbol, that means it erached a leaf, and should output that symbol and start over at state 0; if it sees a number, that means it's now at a different internal node and should read more bits. I'm really not sure whether this would be more efficient than the way I did do it, but it would probably be easier to understand.

Bibliography

Symbols, Signals, and Noise, J.R. Pierce, pp. 94--97. Harper, New York, 1961.

The comp.compression FAQ, is an excellent starting source of information, especially part 2. It is available from ftp://rtfm.mit.edu/pub/usenet/news.answers/compression-faq/.

Notes

My column finally has a name! Bricolage is a hot word these days in the computer learning business; it's French for tinkering or pottering. Thanks to Deven Ullman, who suggested it, and to the other people who suggested other names.

Return to: Universe of Discourse main page | What's new page | Perl Paraphernalia| File Compression Page

mjd-tpj-huffman@plover.com