Posts Tagged ‘Pascal

29
Jan
16

Huffman Coding – Compression results

Before the Christmas break I started writing a encoding/compression library based on a technique called Huffman coding. I’ve since completed and tested the encoder and decoder, today we’ll discuss the results of several compression tests in comparison to the RLE encoding technique. Here are links back to the posts for Huffman Coding and Run-Length Encoding.

So I’ve collected a few different types of data files with different characteristics. First up is a MS-DOS executable, these binary data files typically don’t contain runs of data and will tend to contain most if not all symbols. Next is a ASCII text file, which will typically only use alphabetic, numeric and punctuation characters. Lastly a small number of graphics files, in raw bitmap form, these particular images are fairly basic (only a handful of colours).

Here’s a table showing the results.

Compression ratio table

You’ll note that the Huffman encoder achieves better results in every example! The worst result for it being the MS-DOS executable, which is of course expected. I was pleasantly surprised that the small graphic files compressed quite well, this is likely because of how few symbols are in these files. If there had been a large variety of symbols the dictionary and the encoded data would be much larger.

Run-length encoding was obviously much worse, and in the case of ASCII text and executable data it actually increased the size of the data! Knowing how it works this is hardly surprising. I did expect RLE to perform better than Huffman coding for the graphic data, but I suspect the low colour count in these files is an influencing factor.

So I thought I’d create a graphic data file that had a higher colour count that would favour RLE to see if it can do better under specific conditions. The graphic data I created is a sprite 32×32 pixels with horizontal coloured stripes, each a unique colour. The size of the raw file came out at 1026 bytes.

The Huffman encoder produced a file of 766 bytes and the RLE produced one of 68 bytes! Whilst the Huffman encoder still managed to compress the data, it couldn’t match the best case scenario for RLE.

There are good reasons to use both compression techniques. Run-length encoding doesn’t require much CPU power to encode or decode, so it can be done on very weak machines (even old 8-bit machines), but it only compresses data with many runs in it well. This turns out to be useful for graphics and level data. Huffman coding will pretty much always reduce the size of the data encoded, sometimes significantly, but it is much more complex. It requires more RAM and CPU power to achieve this result, so it can’t be used on every machine, although most modern processors would have no trouble at all.

I’ve made the code for the Huffman encoder/decoder available in pascal for download. be aware the encoder still has some debug code in it that will write to the screen.

Advertisements
30
Nov
15

Huffman Coding

Quite some time ago I did a short post about a compression technique called run-length encoding (or RLE), that was a commonly used compression method for graphic and level data. At the time I wrote and implemented it for the graphics and level data for my home-brew game Bobs Fury with quite the success, significantly reducing the disk space required for the base game.

I do however have some files left, which are largely just ASCII text, that don’t compress well or at all using that technique. I’d like to make them smaller, and of course harder to read as they contain story text, not that my game has a brilliant story, but you get the idea. Step in an encoding technique called Huffman coding.

Essentially the encoding algorithm encodes symbols as a variable length stream of bits. The most frequently used symbols are represented by shorter bit streams, whilst the least used have longer ones. Normally in a computer each symbol would be encoded as a fixed number of bits, such as say 8-bit or 16, so this results (hopefully) in shorter encodings for most symbols, and longer ones only for the rarely used ones.

The tricky part is creating the Huffman tree, which is basically a code-book representing how each symbol is encoded or decoded. Here is a quick tutorial on how they are created, which will also give you a feel for how the encoding works. It’s also commonly known as a dictionary, or code book.

A fixed tree can be used for everything, but would not do the best job for every set of data being compressed. So typically a tree is created and stored along with the encoded data to achieve the best compression possible. This of course does add some overhead, which could be a problem if the resulting encoding isn’t much shorter than the original.

Huffman coding typically works poorly when the symbols all appear in the text at roughly the same frequency. The worst case being everything with exactly the same frequency. Notably it won’t produce an encoding that is longer than the original data, although with the overhead of storing the tree, you could end up with a larger data file. In practise this rarely happens.

Other data such as English text stored in ASCII actually compresses quite well. As usually not all the 255 characters are used, most encodings will be shorter than 8 bits per symbol. Also because natural language uses some letters more than others the average encoding length for all the symbols will be shorter.

Huffman coding was actually invented quite some time ago (1951) by David Huffman, well before it came into common use. Check out the Wikipedia page for more information. It’s a part of many commonly used compression programs and formats such a Zip and Bzip. Older 8-bit machines typically weren’t powerful enough, so it wasn’t commonly used until more powerful machines with more memory became available.

It took me much longer than usual to write this post primarily because I began the process of writing an encoder and decoder, but because of the complexity, it’s taken up much more time than I expected. Currently I have just finished the encoder, but have yet to test it. I had hoped to get the code running first, but that will have to wait.

06
Apr
15

Creating a benchmark: Part 4

Last time I used the  in-line assembler to improve the speed of the sprite blitting functions with lots of success. Other functions such as line drawing however still suffered speed issues on systems with out a built-in FPU such as the 386sx.

The reason for the slow line drawing was simple, I had used a simple slope based algorithm that used floating point. On a system with a FPU this was quite fast, about as fast as anything else, but obviously it wasn’t going to cut the mustard when it came to the old processors I’m targeting. So I decided to give fixed point arithmetic a go.

Fixed point numbers are a variation on basic integers, using some of the bits to represent the fractional part. You can then use integer instructions to do some basic math that involves fractions. This is much faster for processors lacking hardware floating point support. Check out more details of fixed point here or at Wikipedia. It’s a very common technique for embedded systems, but was also used in games such as Doom for speed.

I however ran into a problem. Because the integer in turbo pascal is 16bits I didn’t have enough bits for addressing all the pixels and have enough precision to for the fractional part of the slope. For the integer part I needed to have a range of 0-319, requiring 9 bits unsigned. The fractional part was thus left with 7 bits, with the smallest representable fraction being 1/128th. The smallest conceivable slope in 320×200 is 1/320 which obviously is much smaller. I could have switched to using longints or storing the fractional part separately but this would have added extra overhead that would make line drawing slower.

I set the line drawing problem aside for the moment to look for a quick circle drawing algorithm. Something I hadn’t implemented yet. The SWAG archive came up with suitable options, but also had some algorithms for line drawing. In particular the Bresenham line drawing algorithm.

Not having coded my own low level line drawing routines before I hadn’t heard of it, but it uses all integer math in a clever way to produce the correct slope. I modified the algorithm I found to reduce the amount of calculations for the screen address when plotting points in the line. The resulting code was slightly faster than previous line drawing code, but not overwhelmingly so.

Returning to implementing circle drawing, the algorithms I found unfortunately used floating point math. They all use the basic formula for a circle, x^2 + y^2 = r^2, thus requiring the use of a square root function to calculate the co-ordinates. Unfortunately it is one of the more expensive floating point operations.

One algorithm I looked at used the square root function, but rounded it to an integer immediately. This got me thinking, if I could implement an approximation of square root using integer arithmetic I could draw circles quickly. So after a bit of research I wrote exactly that. Code follows.

function intSqrt(num: word):word;
var
   xo,xn:word;
begin
     {we're using Newtons method for approximating the sqrt}
     if (num=0) then
     begin
         intSqrt:=0;
         exit;
     end;
     xo := 0;
     xn := num;
     if xn=0 then xn:=1;
     while (abs(xo-xn) > 1) do
     begin
          xo := xn;
          xn := (xo + (num div xo)) shr 1;
     end;
     intSqrt := xn;
end;

As stated in the comment, I’ve used Newton’s method for calculating the root, mostly because it’s a simple method I know. I did a little testing, and it turns out this returns pretty much the same result as rounding the built-in sqrt function that uses floating point, only quicker. Once I finished the circle drawing code it drew around 36% more circles in the same time as the BGI code. That’s not bad, but I think there may be more room for optimisation here.

Next time I hope to have optimised the circle routine, and I’ll test the code on some hardware to see how all the different libraries compare to each other on different platforms.

27
Feb
15

Creating a Benchmark: part 3

Last time I started the process of comparing the BGI to a hand coded VGA library. I coded up a fairly lazy completely Pascal library. Today I’ve re-coded some parts of that code using x86 assembly with Pascals in-line assembler.

At work in the IDE

At work in the IDE

The first thing I wanted to tackle was the speed of the sprite blitting and filled boxes as they didn’t seem to live up to their potential. I decided to replace the Pascal move (copy memory) and fillchar (fill memory) functions as they are heavily used in the Pascal only version.

Luckily there are some neat instructions which make copying and filling memory faster even on old processors like the 8086 and 80286. These are MOVS and STOS, both string instructions which actually owe their existence to the Z80 and i8080 where they first appeared. Using them with the REP prefix makes them even better as it helps eliminate some looping code.

MOVS is for copying a block of memory, you load the ES:DI registers with the destination pointer and DS:SI with the source pointer and CX with the loop count. Then execute…

shr cx,1
@again:
rep movsw
jcxz @next
loop @again
@next:
jnc @done
movsb
@done:
...

This code will copy any count of bytes one word (16 bits) at a time, copying a single byte at the end if you specified an odd count. I’ve used JCXZ and LOOP to continue the data copying as some older processors have a bug where the REP MOVSW can end early if an interrupt occurs at the wrong time. I know this isn’t strictly necessary, but it’s a safety measure.

STOS works in much the same way, just it doesn’t source the data from a memory pointer, it uses the accumulator register instead.

With these new memory copy and fill routines done I tested the program to see if I had any improvement in performance. To my surprise there was none, the built-in functions for copying and filling memory must be about as good as what I wrote, but why is the blitting and box filling still slower than they should be?

It turned out that the loops in the filledBox and putImage functions were the culprit. The pascal code looked like this for putImages main loop…

for i:= 0 to sizey do
         copymem(bseg,4+(i*sizex),cardseg,((y+i)*320)+x, sizex);

It didn’t look problematic until I considered the instructions required for calculating the offset into the image data and screen buffer. Multiplication is an unfortunately slow operation, and with some nifty assembly code I rewrote both the putImage and filledBox procedures mostly in assembly, avoiding multiplications in the main part of the loop altogether.

It took me about 2-3 days to get through all the work re-writing two of the drawing functions in assembly, when it took about 1 to write the basic VGA graphics to begin with, but boy did it pay off. After re-writing most of putImage and filledBox in assembly I increased their performance by over 3 times for putImage and almost 2 times for filledBox. Both are also now significantly faster than the BGI implementation, being about twice as fast.

So the BGI is slow compared to raw x86 assembly after all, but it took significant effort to get that performance gain. For the myriad of one-man shareware programmers I still understand why they just went with the BGI, it was easy to use and good enough for what they were doing.

Making a VGA library with straight pascal was fairly easy to do, but had some disadvantages over BGI and wasn’t really quicker. I had to go to assembly before there was any significant performance gain. Coding assembly is daunting to many programmers, and for me is much more time consuming than writing in a higher level language. It will be quite some time before I really finish re-coding the library in assembly.

Next time I’ll have to tackle the line drawing functions, which are using some floating point numbers to accurately draw the lines. I’m planning on converting them to using fixed point numbers to improve speed on machines without an FPU, like my old 386sx. I’m also hoping assembly will help speed things up there to.

29
Jan
15

Creating my own Benchmark program

Looking at benchmarking programs last week got me thinking about creating my own testing software. I’m a Pascal programmer as far as DOS goes and I have read in many programming forums about how slow the Borland graphics interface is. I decided to test this theory out and find out just how slow or fast they are, and what effect the BGI driver and graphics mode has on the performance.

The Borland Graphics Interface is a library that Borland supplied with the Turbo C and Turbo Pascal products. It was used by many because it simplified drawing graphics and meant you didn’t need to write the code for drawing to the screen. This could save a lot of time for the individual programmer, and many shareware programmers used it in their simple games.

As I’ve said before, many have claimed that it is quite slow. So I have written a simple program called BGIbench to test the speed of any BGI driver. I test some of the more important graphics functions that will work on all the drivers, some like page flipping only work with specific drivers.

The drivers I’ve tested here are the standard Borland CGA and EGAVGA ones that came with Turbo Pascal 6 as well as VGA256 (a mode 13h driver), VESA (uses VESA compatible modes) and SVGA256M (another VESA driver) that I found when I started writing my platform game all that time ago.

BGIbench results for Dosbox @3000 cycles

BGIbench results for Dosbox @3000 cycles

Here are the results for testing performed under Dosbox. The result that matter the most for games is the sprites, notably the vga256 driver is the best in this category. All types of lines and circles are about the same between the drivers, although I have noted that drawing circles is pretty slow, slower than even Qbasic if I remember correctly.

Of interest is the filled boxes, which in theory should be about as fast as the sprites, but the EGA,VESA and SVGA256M drivers seem more capable at drawing filled boxes than blitting bitmaps. This doesn’t make sense, there must be something contributing to slow-down with bitmaps and these drivers. They are even slower than the pattern filled boxes!

BGIbench on a real machine pentium MMX 200Mhz

BGIbench on a real machine pentium MMX 200Mhz

On actual hardware things are much different. Notably the VESA driver is nearly as good as the VGA256 one at drawing sprites in the same resolution, and is in fact significantly faster at drawing filled boxes, which makes me think there is still some lost potential in that driver.

The SVGA256M driver has poor performance unfortunately where it counts, blitting sprites. In fact it’s almost as slow as the EGA driver, which is the worst performer. Again there must be significant lost potential as drawing filled boxes is significantly faster, but it suffers worse performance in most line drawing.

Circles and rendering text is again something that performs poorly across the board, so these types of primitives should be avoided if possible when writing BGI based software.

In summary the CGA and EGA drivers offer modes on older hardware that isn’t offered by the others, so they remain useful even though there are faster drivers. The VGA256 driver is the best for 320x200x256 sprite based graphics, but if you wanted to do vector/line drawing the VESA driver performs better in those areas. The VESA driver also offers higher resolution modes, although at a performance hit as it needs to switch memory banks for drawing. The SVGA256M driver is probably the least useful as others out-perform it in all areas and resolutions.

The question remains how fast is the BGI compared to writing a graphics driver yourself? This will be answered another day.

UPDATE: A Download of the results and the program can be found here.

16
Dec
14

Improving Joystick support for Bob’s Fury

Some of the hardware support within Bob’s Fury has been far from ideal, the joystick/game pad being one such device. I had only added support for a simple 2 button joystick, as that’s all I had when I first wrote the code a long time ago. It proved to be inadequate as there just simply wasn’t enough buttons to support all the functions in game, and you couldn’t choose what the buttons do.

Old Joystick Configuration

Old Joystick Configuration

So with the aid of a real machine and a Gravis game pad I worked out how the basic 4 button devices work. It turns out they aren’t much different. I had originally coded my interface to expect two joysticks, each with 2 axis and 2 buttons. It turns out that the buttons on the second joystick correspond to the 3rd and 4th buttons on the Gravis game pad, which means it was fairly simply to allow such a device to support 4 buttons with a minimal effort. I re-wrote the joystick hardware code to be a single joystick with 4 buttons and axis which should support most devices.

New Configuration Screen

New Configuration Screen

Once I got the joystick code re-worked I didn’t want to give fixed functions to the buttons, so I had to recode the configuration interface to allow changing what the buttons do, which funnily enough took longer to get right than the hardware side of things. After some testing in Dosbox I’m pretty happy with the result. I still need to test on some real hardware to make sure everything works, and I need to test a two button joystick to ensure that still works as well. I’ll update the download once I’ve tested it on real hardware.

27
Oct
14

A Guide to Memory/Object Pooling

I was recently listening to a vlog by a games developer who wanted a new programming language specifically for games, and one of his big gripes with C++ was the memory allocation process. He had a number of arguments, none of which resonated with me, but it did spark interest for me in the memory allocation/de-allocation problem. I’ll briefly discuss the problem from the perspective of a language like C (in this case Pascal) and one that uses garbage collection (in this case Java).

Most languages dynamically allocate and free memory in very much the same way as even very old languages such as Pascal did. You have to explicitly ask the API to allocate memory for you, and then explicitly free it when you are finished with it. In an OO language like C++ this results in the need for destructors to free the memory the object uses when it is destroyed.

Either way memory is allocated and freed, and if it is done on a regular basis this can lead to something called heap fragmentation. Heap fragmentation is not quite the same as disk fragmentation in that an allocated block will not be split up. Pascal for instance allocates memory one block after another, simply creating the new one at the end of the old one. However when a block of memory is freed it leaves a hole, these holes are effectively heap fragmentation.

Pascal deals with these holes by maintaining a free list, which is a linked list containing all the free blocks available. When an application allocates more memory the list is traversed to find the first available memory big enough for the new allocation, hopefully filling the hole but possibly leaving a small amount still available. Clearly the holes will likely get smaller over time until they are unusable, and the more holes there are the longer any memory allocation will take.

Other allocation schemes are available in other languages, but they have similar issues, take a look here to see what I mean. Whatever scheme you use there is significant cost the more you allocate and free memory.

Garbage collected languages such as Java have problems of their own. Whilst it is claimed that allocating memory in Java is practically free in terms of complexity, freeing memory is known to have performance issues as the garbage collector (GC for short) kicks in to clean up. The mechanics of the memory allocation mechanism are largely hidden in a language like Java, so that shouldn’t concern you much, but the performance impact a garbage collector can have should concern you.

So why is the GC a potential performance hazard? In order for the GC to know which memory to free it must work out which memory is still being used. One way of doing that is maintaining a reference count for each pointer/reference in use. Obviously any reference with a count of zero can be freed, but this leads to another problem, determining which collections of objects can be freed.

Imagine a data structure like a tree in which every node holds a reference not only to its children but its parent as well. If you were to use the tree, holding only a reference to the root node, then later released that reference, reference counting would not enable freeing the memory the tree uses. Even though no external objects can reach it. A GC therefore also needs to have an algorithm that will check the reachability of memory.

That’s a rather simplistic view of garbage collection, but it’s enough to see that running the GC can be expensive.

So how can you deal with these issues? It seems it doesn’t matter which kind of system/language you use, memory management has real implications for performance. One answer is to use a technique commonly called object or memory pooling.

A simple way to think of a memory pool is that it’s a lot like a cupboard. When you need something it’s quite quick to go to the cupboard and get something that’s available there, and when you’re done with it you return it. If the cupboard turned out to be empty, you’d have to go down to the shops in order to buy more of whatever you need. An advantage being you can stock your cupboard with resources you plan on using later.

Memory and object pools are effectively exactly the same thing. You can preallocate memory, use it, return it to the pool, then use it again. This reduces the amount of stress on the memory allocation system, leading to fewer GCs and less heap fragmentation. You’d be amazed how much difference something like this can make on a system, especially something like Android.

As an example, I implemented a path finding algorithm in Java for a maze game I am writing. I could have used a dynamically sized data stack that grows and shrinks as the algorithm functions. That would however trigger the GC much more frequently as is uses and releases memory on the data stack. It is much better to preallocate the maximum size of the stack once, and reuse the same memory every time the algorithm is used. This only allows one instance of the algorithm to run at once, but it can be adjusted to allow multi-threading by creating a shared pool that each instance can use.

So what kinds of objects or memory should you use pooling for? Ideally anything that is short-lived should have some kind of pooling scheme in place as they put the most stress on the memory allocation system. This would be objects like particles and bullets in a game engine. Long-lived and infrequently used objects that pose less of a problem to the system don’t really need to be pooled as they cause much less stress.

Pooling can also be used for other resources that are expensive to create and destroy such as database or network connections. This can help alleviate stress on other systems such as the network stack, database servers, and file systems. So they are really an import tool in any programmers toolbox to improve performance.




Blogs I Follow

Enter your email address to follow this blog and receive notifications of new posts by email.


Mister G Kids

A daily comic about real stuff little kids say in school. By Matt Gajdoš

Random Battles: my life long level grind

completing every RPG, ever.

Gough's Tech Zone

Reversing the mindless enslavement of humans by technology.

Retrocosm's Vintage Computing, Tech & Scale RC Blog

Random mutterings on retro computing, old technology, some new, plus radio controlled scale modelling.

ancientelectronics

retro computing and gaming plus a little more

Retrocomputing with 90's SPARC

21st-Century computing, the hard way

lazygamereviews

MS-DOS game reviews, retro ramblings and more...