I mailed this to the ggi mailinglist and if it is not 100% correct, i hope i will get email about it. I will then fix this document! I will also paste my code when it is finished.
erikyyy at erikyyy dot de 2000-02-02 08:51:20 CET This document describes a abstraction layer for fast graphics output. The interface is what YOU want (tm) The implementation uses the GGI library (www.ggi-project.org) and is of optimal speed, since i talked to all the ggi developers until all have been satisfied with it. if you have a faster solution FOR THE GIVEN PROBLEM, please mail me! INTERFACE: void FAST_init(); called to initialize display output with a black 640x480x8 screen. void FAST_deinit(); called to close the display. void FAST_setcolor(uint8 colornum, uint16 r, uint16 g, uint16 b); with this you can set all the 256 colors of the 8bit palette. uint8 *FAST_mainmem; this is a 640*480 bytes sized buffer. it is allocated when you call FAST_init(); FAST_deinit() frees the buffer again. this buffer is raw linear data. if you work with it, the content of the display DO NOT CHANGE. (you must call FAST_updatescreen() when you want this) the implementation never changes the content of the FAST_mainmem buffer. use this buffer to create your picture. you can use all of the hosts capabilities to do so. i.e. cache,64bit,byte-ordering and the fastest possible access. (nothing is faster than the computers own memory) void FAST_updatescreen(); calling FAST_updatescreen puts FAST_mainmem onto the display. After the call, the new stuff is visible to the user. WHAT IS THIS GOOD FOR: Programs like DOOM and quake. I.e. programs that do all their rendering in software (FAST_mainmem) and then do a large put onto the display (FAST_updatescreen()). The program usually changes the whole screen every time. If it instead only updates some parts of the screen, the solution presented in this document is ABSOLUTLY NOT OPTIMAL SPEED! (but hey it is easy to use ;) If you do a 3D engine that puts solid color polygons, the solution is also NOT OPTIMAL, because you could use hardware accelerated horizontal lines for this. of course everything that could be hardware accelerated won't be hardware accelerated in this solution! the program must use 640x480x8 mode. the interface cannot do other modes (but might be easy to change) IMPLEMENTATION: it works on every GGI target. so it also works when your Xserver has 16bit instead of 8. i will first tell how to do it, and then WHY this is the fastest possible solution. HOW TO DO IT: void FAST_init(); do FAST_mainmem = (uint8*) malloc(640*480); open a ggi visual 640x480x8. if this is not possible, open a ggi visual 640x480x8 with the palemu target and warn the user that the program would run MUCH faster if he used a 8bit capable display. (e.g. Xserver) put the visual in asyncronous mode (very important for speed) try to open two frames. if this is not possible, you should open only one frame which is always possible. see below. do not do anything with directbuffers, it's useless. (see below why) void FAST_deinit(); free(FAST_mainmem); close all the ggi stuff you opened. void FAST_updatescreen(); use ggiPutBox to put the FAST_mainmem into the invisible frame. call a ggiFlush on your visual. then you switch the display to the visible frame. (with ggiSetDisplayFrame and ggiSetWriteFrame) if you do not have 2 frames, you just ggiPutBox into the visible frame. (you will then get a little flicker like in old vga320x200 days) WHY THIS IS THE FASTEST AND BEST SOLUTION you could use directbuffers. (if you have 2 frames !) advantage: - after painting you need not put the mainmem buffer onto the graphics card, since it is already there. disadvantage: - every access goes through the slow pci bus. if you do for example texture mapping and put pixel after pixel, this will be slower than doing texture mapping in the mainmem buffer and putting the whole buffer in 32bit bursts onto the graphics hardware - access is slow. this is bad when you for example first paint the background and then stack several big sprites on top of it. i.e. if writing in the directbuffer takes 3 sec. and writing in the mainmem buffer takes 1 sec. and ggiPutBox the mainmem buffer onto the directbuffer takes 4 sec. then if you only write once, writing in the directbuffer will take only 3 seconds. but writing in the mainmem buffer will take 1+4=5 seconds in the whole. so directbuffer wins. but if you do 10 times overwrite, writing in the directbuffer takes 3*10=30 seconds, but in the mainmem buffer it takes 1*10+4=14 seconds. so mainmem buffer wins. - if you do 32bit or 16bit access, some hardware cannot do this. (you can check this in some ggi structures) so you will have to fallback to 8bit. so in every of your graphics routines that use more than 8bit at a time, you must deal with this problem and also write some code for smaller access. On other architectures like the ALPHA, your program will die with a SIGBUS error if your 32bit or 16bit accesses are not aligned ! So again you would have more complicated hardware dependend specialization in your routines. - ggiPutBox is always maximally optimized for the current kind of hardware. it will do burst accesses or whatever possible. this is really much more efficient than if directly fiddle around in the directbuffer, believe me ;) why not use a mem-target instead of the mainmem buffer and do a ggiCrossBlit with it ? because this is exactly the same thing. it is exactly the same speed. but it is more complicated to code, so why should you do it. ggi folks told me that CrossBlit could do DMA bursts on some hardware. but if CrossBlit from mem-target to display does DMA bursts and ggiPutBox from malloced memory to display doesn't, then i think it is a ggi implementation problem. in theory, the mem-target couly be a malloced buffer somewhere in special area of memory which makes specially fast DMA bursts possible. then CrossBlit would be faster. If this ever happens, please tell me. why use palemu target if visual is not 8 bit, why not use the not 8 bit visual and do a ggiCrossBlit from my 8bit mem-target visual to the not 8 bit visual ? because palemu is faster. eh. this depends ;) some of the ggi folks say CrossBlit is faster than palemu. I think the palemu is faster, because it can hold palette dependend structures and calculate them once but use often. the crossblit must recalculate them each time you call a crossblit. so palemu should be faster. once i receive additional information, i will post it here. but until then, use palemu.
(erikyyy at erikyyy dot de, Erik Thiele)