Archive for the 'Pascal' Category

31
Jan
17

Creating a benchmark: part 6 problem solved!

Quite some time ago, in fact more than a year ago now, I was working on a basic series of benchmarks to test the comparative performance of the Borland Graphics Interface (BGI) versus a hand coded graphics library. I ran into a problem with my hand coded library not working on some of my real hardware. I ran the tests on a Pentium MMX @ 200Mhz and a 386sx @ 20mhz, it ran fine on the newer machine whilst failing on the older one. I figured perhaps I was overloading the older graphics chip.

So coming back to the problem today with a renewed sense of determination, I did some reading. There happened to be a book in the university library about programming VGA graphics, and I noted that all of their code only copies single bytes to graphic memory at a time, I was copying 2 bytes (a 16 bit word) at a time so wondered if that might be the issue. I changed it and the program still crashed almost immediately.

After some tinkering and some basic math I worked out that sprites being drawn near to the bottom of the screen were the cause of the problem. The test algorithm actually allows sprites to be drawn partially obscured by the right or bottom edge of the screen. On most VGA cards this isn’t a problem, but the trident chip in the 386sx didn’t like any pixels being drawn outside of the visible frame buffer. After writing some basic clipping into the sprite routines the program worked all the way through.

I’ve tested 3 different programs on two different real machines (as opposed to emulation). BGIbench is the original program that uses the provided graphics libraries that came with Turbo Pascal, I’ve used it in conjunction with the VGA256 BGI driver. VGABench is the initial lazy implementation of a hand made graphics library, it’s implemented entirely in pascal and isn’t optimised at all. Lastly is VGABench2 which is an optimised version of VGABench using assembly where necessary, and in the case of line drawing a better algorithm. All three programs were coded and compiled with Turbo Pascal 6.0 and use the same basic test code.

Each program performs 7 tests based around primitive functions found in the BGI. Each test counts the number of primitives that can be drawn over a period of 30 seconds. VGABench doesn’t have a function for drawing circles so it has no results for that test. The tests in order: put-pixels simply draws individual pixels a random colour. Filled Boxes draws solid coloured rectangles in a pre-defined pattern. Circles draws a number of concentric circles in a per-determined way. Random lines does what you’d expect, drawing lines randomly. Horizontal and vertical lines are similarly obvious. Finally the sprite test draws a 10×10 bitmap at random locations on the screen.

At first glance it’s pretty obvious that optimised, hand written code is significantly faster on the Pentium, particularly for filled boxes and sprites where VGABench2 achieves roughly twice the output to the screen. I managed a five-fold increase in the rate of drawing circles, which is impressive as circles are the hardest to draw. The first VGA Bench however is not really much better than the BGI and in some tests actually performs worse. You’ll note that the put-pixel tests all come out fairly close in terms of results, this is because there is little to optimise there.

I suspected that the reason VGABench2 performed so well is because the code copies 16 bits at a time instead of 8bits. I tested this out by changing it to copy 8bits at a time and found it was still faster than the BGI, but by a much smaller margin. VGABench2 copying 16bits yields about 156k sprites, but modifying it to copy 8bits yielded about 80k compared to BGI which achieves around 73k sprites. The first VGABench demonstrates how important optimisation is. It copies data 16bits at a time, but doesn’t even achieve the performance that BGI does, managing around 68k sprites.

386sx-20

The picture looks quite different on the 386sx machine, with the performance looking much more even with a few exceptions. The BGI seems to perform comparatively well across most categories only lagging behind in drawing circles, filled boxes and sprites. My first lazy implementation, VGABench seems to lag behind in pretty much everything except drawing filled boxes, which barely outperforms the BGI.

The results for VGABench2 are good, but not as good as on the Pentium machine. Line drawing is basically the same speed as the BGI. Filled boxes achieves about twice the speed, and circles about 5 times the speed, but the sprites are comparatively slower at about 1.5 times the speed. The explanation for the performance of sprites and filled boxes is interesting and is related to how the test is implemented. The filled boxes are drawn in a deterministic way, a grid of 10×10 sized boxes, the sprites are distributed randomly. Filled boxes end up being drawn pretty much always on even addresses, and sprites will be drawn on even and odd addresses around the same amount. This affects speed because of something called word alignment.

The 386sx, 286 and 8086 processors have a 16 bit data bus, which means the processor can access 16bits at a time. The memory is organised as a bunch of 16bit words, so when accessing 16 bits on an even address only a single memory word is accessed, and when a 16 bit access needs an odd address, two memory words are required. This means doing 16bit reads/writes on odd addresses are half as fast as on even ones, in fact they are about the same speed as an 8 bit transfer.

In beginning to sum up the series, it’s important to remember why I started it at all. Basically I remembered hearing many people discouraging use of the BGI mostly because it is slow compared to hand crafted code. The tests I’ve done have confirmed this, but I feel that it’s also shown how a lazy (or poor) implementation can be even slower. My optimised code is faster, but took a lot of time and effort to create and is missing many features that the BGI provides, such as support for other graphics cards, clipping, and other graphics primitives that are more complicated. I can see why many people without the time and know-how would have found it easier to simply use the graphics library provided.

That being said, hand written optimised code certainly has an important place as well. It was pretty fun and challenging to try and make something faster, even though I almost certainly didn’t get my code anywhere near as fast as more proficient assembly programmers. Also it’s hand optimised code that made many PC action games possible at all. Gaming on the PC would be very different without the hardware guru’s that could squeeze amazing things out of basic hardware. Writing this has made me reconsider my stance on sticking with the BGI for my platform game, I probably won’t gain a lot of performance, but I may be able to get it to work on older hardware as a consequence.

Code and binaries are available from the pascal downloads.

Advertisements
29
Jan
16

Huffman Coding – Compression results

Before the Christmas break I started writing a encoding/compression library based on a technique called Huffman coding. I’ve since completed and tested the encoder and decoder, today we’ll discuss the results of several compression tests in comparison to the RLE encoding technique. Here are links back to the posts for Huffman Coding and Run-Length Encoding.

So I’ve collected a few different types of data files with different characteristics. First up is a MS-DOS executable, these binary data files typically don’t contain runs of data and will tend to contain most if not all symbols. Next is a ASCII text file, which will typically only use alphabetic, numeric and punctuation characters. Lastly a small number of graphics files, in raw bitmap form, these particular images are fairly basic (only a handful of colours).

Here’s a table showing the results.

Compression ratio table

You’ll note that the Huffman encoder achieves better results in every example! The worst result for it being the MS-DOS executable, which is of course expected. I was pleasantly surprised that the small graphic files compressed quite well, this is likely because of how few symbols are in these files. If there had been a large variety of symbols the dictionary and the encoded data would be much larger.

Run-length encoding was obviously much worse, and in the case of ASCII text and executable data it actually increased the size of the data! Knowing how it works this is hardly surprising. I did expect RLE to perform better than Huffman coding for the graphic data, but I suspect the low colour count in these files is an influencing factor.

So I thought I’d create a graphic data file that had a higher colour count that would favour RLE to see if it can do better under specific conditions. The graphic data I created is a sprite 32×32 pixels with horizontal coloured stripes, each a unique colour. The size of the raw file came out at 1026 bytes.

The Huffman encoder produced a file of 766 bytes and the RLE produced one of 68 bytes! Whilst the Huffman encoder still managed to compress the data, it couldn’t match the best case scenario for RLE.

There are good reasons to use both compression techniques. Run-length encoding doesn’t require much CPU power to encode or decode, so it can be done on very weak machines (even old 8-bit machines), but it only compresses data with many runs in it well. This turns out to be useful for graphics and level data. Huffman coding will pretty much always reduce the size of the data encoded, sometimes significantly, but it is much more complex. It requires more RAM and CPU power to achieve this result, so it can’t be used on every machine, although most modern processors would have no trouble at all.

I’ve made the code for the Huffman encoder/decoder available in pascal for download. be aware the encoder still has some debug code in it that will write to the screen.

30
Nov
15

Huffman Coding

Quite some time ago I did a short post about a compression technique called run-length encoding (or RLE), that was a commonly used compression method for graphic and level data. At the time I wrote and implemented it for the graphics and level data for my home-brew game Bobs Fury with quite the success, significantly reducing the disk space required for the base game.

I do however have some files left, which are largely just ASCII text, that don’t compress well or at all using that technique. I’d like to make them smaller, and of course harder to read as they contain story text, not that my game has a brilliant story, but you get the idea. Step in an encoding technique called Huffman coding.

Essentially the encoding algorithm encodes symbols as a variable length stream of bits. The most frequently used symbols are represented by shorter bit streams, whilst the least used have longer ones. Normally in a computer each symbol would be encoded as a fixed number of bits, such as say 8-bit or 16, so this results (hopefully) in shorter encodings for most symbols, and longer ones only for the rarely used ones.

The tricky part is creating the Huffman tree, which is basically a code-book representing how each symbol is encoded or decoded. Here is a quick tutorial on how they are created, which will also give you a feel for how the encoding works. It’s also commonly known as a dictionary, or code book.

A fixed tree can be used for everything, but would not do the best job for every set of data being compressed. So typically a tree is created and stored along with the encoded data to achieve the best compression possible. This of course does add some overhead, which could be a problem if the resulting encoding isn’t much shorter than the original.

Huffman coding typically works poorly when the symbols all appear in the text at roughly the same frequency. The worst case being everything with exactly the same frequency. Notably it won’t produce an encoding that is longer than the original data, although with the overhead of storing the tree, you could end up with a larger data file. In practise this rarely happens.

Other data such as English text stored in ASCII actually compresses quite well. As usually not all the 255 characters are used, most encodings will be shorter than 8 bits per symbol. Also because natural language uses some letters more than others the average encoding length for all the symbols will be shorter.

Huffman coding was actually invented quite some time ago (1951) by David Huffman, well before it came into common use. Check out the Wikipedia page for more information. It’s a part of many commonly used compression programs and formats such a Zip and Bzip. Older 8-bit machines typically weren’t powerful enough, so it wasn’t commonly used until more powerful machines with more memory became available.

It took me much longer than usual to write this post primarily because I began the process of writing an encoder and decoder, but because of the complexity, it’s taken up much more time than I expected. Currently I have just finished the encoder, but have yet to test it. I had hoped to get the code running first, but that will have to wait.

28
Aug
15

Creating a Benchmark: part 5 – a road-block in testing

Some months ago I set about working on creating a bench mark program to compare the Borland Graphics Interface (BGI for short) with a hand coded in assembly VGA library. I had completed work on the library and even tested it on my Pentium 200Mhz MMX. So I decided to do the final tests to get the results when I ran into a road-block that still has me stumped.

I had three different setups to test the code on. Dosbox (at a fixed cycle count), my Pentium 200Mhz MMX machine running MS-DOS 6.22 and this machine (the last in the post), a 386sx running at 20Mhz. The BGI code runs fine on all three machines, but the hand coded library fails on the 386sx locking up during the first test – blitting sprites.

This was quite confusing as the test appears to at least partially work as a number of sprites are drawn to the screen quite quickly, but sometime during the test it always freezes. I thought about what could cause this and decided to try disabling interrupts during the drawing process. I did this as the main instruction used for copying bytes to video RAM does have an interrupt bug on older processors (286). This unfortunately had no effect. See this previous post to see code I used for copying memory.

Whilst disabling interrupts didn’t fix the problem, it did confirm which instruction was causing the problem. The rep movsw instruction was the only one I put between disabling and enabling interrupts (using cli and sti). When the machine froze with this change it no longer responded to keyboard interrupts that it did before. This indicates a crash during either an interrupt itself (unlikely) or this repeating instruction.

Given this strange problem I decided to test the machine by running memory tests and another graphical benchmark to see if there could be a hardware issue. It seems that the memory is ok, and it ran a 3d VGA benchmark with no trouble over night. I’ve also played some games such as Hocus Pocus that have parallax scrolling and would put a bit of load on the relevant parts with no problems.

So the question arises, am I putting too much data/strain into the Trident graphics chip for it to cope? I’m unsure if i can answer this. The process of copying the memory has several overheads. The processor has to read the word, write the word, increment the memory indexes and then decide whether to stop copying for each iteration of the instruction. This would at best be one word copied per 2 bus clock cycles, but is likely to be quite a bit slower, from memory I think like 6 clocks per memory copy for a 386.

I think it unlikely that I’m over taxing the graphic chip, but this problem does highlight one of the major problems people had when creating their own graphics libraries. Small shareware developers couldn’t have tested enough hardware to ensure that what they wrote would work on a large array of systems. The BGI however, having more resources behind it, has a quite good compatibility record.

06
Apr
15

Creating a benchmark: Part 4

Last time I used the  in-line assembler to improve the speed of the sprite blitting functions with lots of success. Other functions such as line drawing however still suffered speed issues on systems with out a built-in FPU such as the 386sx.

The reason for the slow line drawing was simple, I had used a simple slope based algorithm that used floating point. On a system with a FPU this was quite fast, about as fast as anything else, but obviously it wasn’t going to cut the mustard when it came to the old processors I’m targeting. So I decided to give fixed point arithmetic a go.

Fixed point numbers are a variation on basic integers, using some of the bits to represent the fractional part. You can then use integer instructions to do some basic math that involves fractions. This is much faster for processors lacking hardware floating point support. Check out more details of fixed point here or at Wikipedia. It’s a very common technique for embedded systems, but was also used in games such as Doom for speed.

I however ran into a problem. Because the integer in turbo pascal is 16bits I didn’t have enough bits for addressing all the pixels and have enough precision to for the fractional part of the slope. For the integer part I needed to have a range of 0-319, requiring 9 bits unsigned. The fractional part was thus left with 7 bits, with the smallest representable fraction being 1/128th. The smallest conceivable slope in 320×200 is 1/320 which obviously is much smaller. I could have switched to using longints or storing the fractional part separately but this would have added extra overhead that would make line drawing slower.

I set the line drawing problem aside for the moment to look for a quick circle drawing algorithm. Something I hadn’t implemented yet. The SWAG archive came up with suitable options, but also had some algorithms for line drawing. In particular the Bresenham line drawing algorithm.

Not having coded my own low level line drawing routines before I hadn’t heard of it, but it uses all integer math in a clever way to produce the correct slope. I modified the algorithm I found to reduce the amount of calculations for the screen address when plotting points in the line. The resulting code was slightly faster than previous line drawing code, but not overwhelmingly so.

Returning to implementing circle drawing, the algorithms I found unfortunately used floating point math. They all use the basic formula for a circle, x^2 + y^2 = r^2, thus requiring the use of a square root function to calculate the co-ordinates. Unfortunately it is one of the more expensive floating point operations.

One algorithm I looked at used the square root function, but rounded it to an integer immediately. This got me thinking, if I could implement an approximation of square root using integer arithmetic I could draw circles quickly. So after a bit of research I wrote exactly that. Code follows.

function intSqrt(num: word):word;
var
   xo,xn:word;
begin
     {we're using Newtons method for approximating the sqrt}
     if (num=0) then
     begin
         intSqrt:=0;
         exit;
     end;
     xo := 0;
     xn := num;
     if xn=0 then xn:=1;
     while (abs(xo-xn) > 1) do
     begin
          xo := xn;
          xn := (xo + (num div xo)) shr 1;
     end;
     intSqrt := xn;
end;

As stated in the comment, I’ve used Newton’s method for calculating the root, mostly because it’s a simple method I know. I did a little testing, and it turns out this returns pretty much the same result as rounding the built-in sqrt function that uses floating point, only quicker. Once I finished the circle drawing code it drew around 36% more circles in the same time as the BGI code. That’s not bad, but I think there may be more room for optimisation here.

Next time I hope to have optimised the circle routine, and I’ll test the code on some hardware to see how all the different libraries compare to each other on different platforms.

27
Feb
15

Creating a Benchmark: part 3

Last time I started the process of comparing the BGI to a hand coded VGA library. I coded up a fairly lazy completely Pascal library. Today I’ve re-coded some parts of that code using x86 assembly with Pascals in-line assembler.

At work in the IDE

At work in the IDE

The first thing I wanted to tackle was the speed of the sprite blitting and filled boxes as they didn’t seem to live up to their potential. I decided to replace the Pascal move (copy memory) and fillchar (fill memory) functions as they are heavily used in the Pascal only version.

Luckily there are some neat instructions which make copying and filling memory faster even on old processors like the 8086 and 80286. These are MOVS and STOS, both string instructions which actually owe their existence to the Z80 and i8080 where they first appeared. Using them with the REP prefix makes them even better as it helps eliminate some looping code.

MOVS is for copying a block of memory, you load the ES:DI registers with the destination pointer and DS:SI with the source pointer and CX with the loop count. Then execute…

shr cx,1
@again:
rep movsw
jcxz @next
loop @again
@next:
jnc @done
movsb
@done:
...

This code will copy any count of bytes one word (16 bits) at a time, copying a single byte at the end if you specified an odd count. I’ve used JCXZ and LOOP to continue the data copying as some older processors have a bug where the REP MOVSW can end early if an interrupt occurs at the wrong time. I know this isn’t strictly necessary, but it’s a safety measure.

STOS works in much the same way, just it doesn’t source the data from a memory pointer, it uses the accumulator register instead.

With these new memory copy and fill routines done I tested the program to see if I had any improvement in performance. To my surprise there was none, the built-in functions for copying and filling memory must be about as good as what I wrote, but why is the blitting and box filling still slower than they should be?

It turned out that the loops in the filledBox and putImage functions were the culprit. The pascal code looked like this for putImages main loop…

for i:= 0 to sizey do
         copymem(bseg,4+(i*sizex),cardseg,((y+i)*320)+x, sizex);

It didn’t look problematic until I considered the instructions required for calculating the offset into the image data and screen buffer. Multiplication is an unfortunately slow operation, and with some nifty assembly code I rewrote both the putImage and filledBox procedures mostly in assembly, avoiding multiplications in the main part of the loop altogether.

It took me about 2-3 days to get through all the work re-writing two of the drawing functions in assembly, when it took about 1 to write the basic VGA graphics to begin with, but boy did it pay off. After re-writing most of putImage and filledBox in assembly I increased their performance by over 3 times for putImage and almost 2 times for filledBox. Both are also now significantly faster than the BGI implementation, being about twice as fast.

So the BGI is slow compared to raw x86 assembly after all, but it took significant effort to get that performance gain. For the myriad of one-man shareware programmers I still understand why they just went with the BGI, it was easy to use and good enough for what they were doing.

Making a VGA library with straight pascal was fairly easy to do, but had some disadvantages over BGI and wasn’t really quicker. I had to go to assembly before there was any significant performance gain. Coding assembly is daunting to many programmers, and for me is much more time consuming than writing in a higher level language. It will be quite some time before I really finish re-coding the library in assembly.

Next time I’ll have to tackle the line drawing functions, which are using some floating point numbers to accurately draw the lines. I’m planning on converting them to using fixed point numbers to improve speed on machines without an FPU, like my old 386sx. I’m also hoping assembly will help speed things up there to.

12
Feb
15

Creating a Benchmark: part2

A couple of weeks ago I created a basic benchmarking program for measuring the speed of the Borland Graphics Interface and its drivers. I’m primarily interested in how they compare not only to each other, but also to a hand-coded implementation. So this week I created a VGA graphics unit by hand and made a benchmark program around it.

I chose coding for VGA 320x200x256 as it is the easiest mode to code for and matches more of the BGI drivers. You simply initiate the graphics mode 13h (h for hex) with the video BIOS, this sets up a linear buffer for drawing at the memory location A000h:0000h. Each pixel is a single byte, so drawing a pixel doesn’t require bit masking unless you want it to. Drawing a pixel simply involves changing the byte at the offset following this simple formula. (y*320) + x.

Given this information it was no problem at all to code up a basic graphics unit. I didn’t use much in the way of assembly code to implement the unit, partly out of laziness, instead opting to implement it using Pascal code mostly. I haven’t implemented all the graphical functions in the BGI, simply because there are way too many.

Here are the results when tested under Dosbox with 3000 cycles. I used pretty much the same code to perform the measurement to ensure as much consistency as possible. It’s quite interesting to see that implementing your own graphics unit doesn’t really provide that much extra performance for most functions, and in this case blitting sprites is actually slower using my code! I suspect this is because I used a built-in Pascal function for copying memory that may not be super fast. I did note it is still faster than both the VESA and SVGA256M driver in the same mode.

So is it worth implementing your own graphics driver instead of using the BGI. The answer is a sorta, maybe. I haven’t optimised my graphics code in this case, so it surely could be a bit faster, but I did manage to use less memory for storing the sprites, and my code was much smaller in terms of size. However the Graph unit and VGA256 actually seem to have some decent performance comparatively, so if you need compatibility with other cards that are more difficult to code for, or simply don’t have time to code a graphics unit of your own, then the graph/BGI implementation isn’t too bad.

Code and DOS binary are available here.




Blogs I Follow

Enter your email address to follow this blog and receive notifications of new posts by email.


Mister G Kids

A daily comic about real stuff little kids say in school. By Matt Gajdoš

Random Battles: my life long level grind

completing every RPG, ever.

Gough's Tech Zone

Reversing the mindless enslavement of humans by technology.

Retrocosm's Vintage Computing, Tech & Scale RC Blog

Random mutterings on retro computing, old technology, some new, plus radio controlled scale modelling.

ancientelectronics

retro computing and gaming plus a little more

Retrocomputing with 90's SPARC

21st-Century computing, the hard way

lazygamereviews

MS-DOS game reviews, retro ramblings and more...