Quite some time ago I did a short post about a compression technique called run-length encoding (or RLE), that was a commonly used compression method for graphic and level data. At the time I wrote and implemented it for the graphics and level data for my home-brew game Bobs Fury with quite the success, significantly reducing the disk space required for the base game.
I do however have some files left, which are largely just ASCII text, that don’t compress well or at all using that technique. I’d like to make them smaller, and of course harder to read as they contain story text, not that my game has a brilliant story, but you get the idea. Step in an encoding technique called Huffman coding.
Essentially the encoding algorithm encodes symbols as a variable length stream of bits. The most frequently used symbols are represented by shorter bit streams, whilst the least used have longer ones. Normally in a computer each symbol would be encoded as a fixed number of bits, such as say 8-bit or 16, so this results (hopefully) in shorter encodings for most symbols, and longer ones only for the rarely used ones.
The tricky part is creating the Huffman tree, which is basically a code-book representing how each symbol is encoded or decoded. Here is a quick tutorial on how they are created, which will also give you a feel for how the encoding works. It’s also commonly known as a dictionary, or code book.
A fixed tree can be used for everything, but would not do the best job for every set of data being compressed. So typically a tree is created and stored along with the encoded data to achieve the best compression possible. This of course does add some overhead, which could be a problem if the resulting encoding isn’t much shorter than the original.
Huffman coding typically works poorly when the symbols all appear in the text at roughly the same frequency. The worst case being everything with exactly the same frequency. Notably it won’t produce an encoding that is longer than the original data, although with the overhead of storing the tree, you could end up with a larger data file. In practise this rarely happens.
Other data such as English text stored in ASCII actually compresses quite well. As usually not all the 255 characters are used, most encodings will be shorter than 8 bits per symbol. Also because natural language uses some letters more than others the average encoding length for all the symbols will be shorter.
Huffman coding was actually invented quite some time ago (1951) by David Huffman, well before it came into common use. Check out the Wikipedia page for more information. It’s a part of many commonly used compression programs and formats such a Zip and Bzip. Older 8-bit machines typically weren’t powerful enough, so it wasn’t commonly used until more powerful machines with more memory became available.
It took me much longer than usual to write this post primarily because I began the process of writing an encoder and decoder, but because of the complexity, it’s taken up much more time than I expected. Currently I have just finished the encoder, but have yet to test it. I had hoped to get the code running first, but that will have to wait.
Recent Comments