Huffman Coding

Quite some time ago I did a short post about a compression technique called run-length encoding (or RLE), that was a commonly used compression method for graphic and level data. At the time I wrote and implemented it for the graphics and level data for my home-brew game Bobs Fury with quite the success, significantly reducing the disk space required for the base game.

I do however have some files left, which are largely just ASCII text, that don’t compress well or at all using that technique. I’d like to make them smaller, and of course harder to read as they contain story text, not that my game has a brilliant story, but you get the idea. Step in an encoding technique called Huffman coding.

Essentially the encoding algorithm encodes symbols as a variable length stream of bits. The most frequently used symbols are represented by shorter bit streams, whilst the least used have longer ones. Normally in a computer each symbol would be encoded as a fixed number of bits, such as say 8-bit or 16, so this results (hopefully) in shorter encodings for most symbols, and longer ones only for the rarely used ones.

The tricky part is creating the Huffman tree, which is basically a code-book representing how each symbol is encoded or decoded. Here is a quick tutorial on how they are created, which will also give you a feel for how the encoding works. It’s also commonly known as a dictionary, or code book.

A fixed tree can be used for everything, but would not do the best job for every set of data being compressed. So typically a tree is created and stored along with the encoded data to achieve the best compression possible. This of course does add some overhead, which could be a problem if the resulting encoding isn’t much shorter than the original.

Huffman coding typically works poorly when the symbols all appear in the text at roughly the same frequency. The worst case being everything with exactly the same frequency. Notably it won’t produce an encoding that is longer than the original data, although with the overhead of storing the tree, you could end up with a larger data file. In practise this rarely happens.

Other data such as English text stored in ASCII actually compresses quite well. As usually not all the 255 characters are used, most encodings will be shorter than 8 bits per symbol. Also because natural language uses some letters more than others the average encoding length for all the symbols will be shorter.

Huffman coding was actually invented quite some time ago (1951) by David Huffman, well before it came into common use. Check out the Wikipedia page for more information. It’s a part of many commonly used compression programs and formats such a Zip and Bzip. Older 8-bit machines typically weren’t powerful enough, so it wasn’t commonly used until more powerful machines with more memory became available.

It took me much longer than usual to write this post primarily because I began the process of writing an encoder and decoder, but because of the complexity, it’s taken up much more time than I expected. Currently I have just finished the encoder, but have yet to test it. I had hoped to get the code running first, but that will have to wait.


0 Responses to “Huffman Coding”

  1. Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blogs I Follow

Enter your email address to follow this blog and receive notifications of new posts by email.

Mister G Kids

A daily comic about real stuff little kids say in school. By Matt Gajdoš

Random Battles: my life long level grind

completing every RPG, ever.

Gough's Tech Zone

Reversing the mindless enslavement of humans by technology.

Retrocosm's Vintage Computing, Tech & Scale RC Blog

Random mutterings on retro computing, old technology, some new, plus radio controlled scale modelling.


retro computing and gaming plus a little more

Retrocomputing with 90's SPARC

21st-Century computing, the hard way


MS-DOS game reviews, retro ramblings and more...

%d bloggers like this: