Last time I started the process of comparing the BGI to a hand coded VGA library. I coded up a fairly lazy completely Pascal library. Today I’ve re-coded some parts of that code using x86 assembly with Pascals in-line assembler.
The first thing I wanted to tackle was the speed of the sprite blitting and filled boxes as they didn’t seem to live up to their potential. I decided to replace the Pascal move (copy memory) and fillchar (fill memory) functions as they are heavily used in the Pascal only version.
Luckily there are some neat instructions which make copying and filling memory faster even on old processors like the 8086 and 80286. These are MOVS and STOS, both string instructions which actually owe their existence to the Z80 and i8080 where they first appeared. Using them with the REP prefix makes them even better as it helps eliminate some looping code.
MOVS is for copying a block of memory, you load the ES:DI registers with the destination pointer and DS:SI with the source pointer and CX with the loop count. Then execute…
shr cx,1 @again: rep movsw jcxz @next loop @again @next: jnc @done movsb @done: ...
This code will copy any count of bytes one word (16 bits) at a time, copying a single byte at the end if you specified an odd count. I’ve used JCXZ and LOOP to continue the data copying as some older processors have a bug where the REP MOVSW can end early if an interrupt occurs at the wrong time. I know this isn’t strictly necessary, but it’s a safety measure.
STOS works in much the same way, just it doesn’t source the data from a memory pointer, it uses the accumulator register instead.
With these new memory copy and fill routines done I tested the program to see if I had any improvement in performance. To my surprise there was none, the built-in functions for copying and filling memory must be about as good as what I wrote, but why is the blitting and box filling still slower than they should be?
It turned out that the loops in the filledBox and putImage functions were the culprit. The pascal code looked like this for putImages main loop…
for i:= 0 to sizey do copymem(bseg,4+(i*sizex),cardseg,((y+i)*320)+x, sizex);
It didn’t look problematic until I considered the instructions required for calculating the offset into the image data and screen buffer. Multiplication is an unfortunately slow operation, and with some nifty assembly code I rewrote both the putImage and filledBox procedures mostly in assembly, avoiding multiplications in the main part of the loop altogether.
It took me about 2-3 days to get through all the work re-writing two of the drawing functions in assembly, when it took about 1 to write the basic VGA graphics to begin with, but boy did it pay off. After re-writing most of putImage and filledBox in assembly I increased their performance by over 3 times for putImage and almost 2 times for filledBox. Both are also now significantly faster than the BGI implementation, being about twice as fast.
So the BGI is slow compared to raw x86 assembly after all, but it took significant effort to get that performance gain. For the myriad of one-man shareware programmers I still understand why they just went with the BGI, it was easy to use and good enough for what they were doing.
Making a VGA library with straight pascal was fairly easy to do, but had some disadvantages over BGI and wasn’t really quicker. I had to go to assembly before there was any significant performance gain. Coding assembly is daunting to many programmers, and for me is much more time consuming than writing in a higher level language. It will be quite some time before I really finish re-coding the library in assembly.
Next time I’ll have to tackle the line drawing functions, which are using some floating point numbers to accurately draw the lines. I’m planning on converting them to using fixed point numbers to improve speed on machines without an FPU, like my old 386sx. I’m also hoping assembly will help speed things up there to.