how to do fastest possible access with ggi

I use the following strategy to do fastest possible output on ggi (http://www.ggi-project.org/).
I mailed this to the ggi mailinglist and if it is not 100% correct, i hope i will get email about it. I will then fix this document! I will also paste my code when it is finished.
erikyyy at erikyyy dot de    2000-02-02 08:51:20 CET

This document describes a abstraction layer for fast
graphics output. The interface is what YOU want (tm)
The implementation uses the GGI library (www.ggi-project.org)
and is of optimal speed, since i talked to all the ggi
developers until all have been satisfied with it.
if you have a faster solution FOR THE GIVEN PROBLEM, please mail me!

INTERFACE:

void FAST_init();
  called to initialize display output with a black 640x480x8 screen.

void FAST_deinit();
  called to close the display.

void FAST_setcolor(uint8 colornum, uint16 r, uint16 g, uint16 b);
  with this you can set all the 256 colors of the 8bit palette.

uint8 *FAST_mainmem;
  this is a 640*480 bytes sized buffer. it is allocated when you
  call FAST_init();
  FAST_deinit() frees the buffer again.
  this buffer is raw linear data. if you work with it, the
  content of the display DO NOT CHANGE. (you must call
  FAST_updatescreen() when you want this)
  the implementation never changes the content of the FAST_mainmem
  buffer.
  use this buffer to create your picture. you can use
  all of the hosts capabilities to do so.
  i.e. cache,64bit,byte-ordering and the fastest possible access.
  (nothing is faster than the computers own memory)

void FAST_updatescreen();
  calling FAST_updatescreen puts FAST_mainmem onto the display.
  After the call, the new stuff is visible to the user.


WHAT IS THIS GOOD FOR:

Programs like DOOM and quake. I.e. programs that do all their rendering
in software (FAST_mainmem) and then do a large put onto the display
(FAST_updatescreen()). The program usually changes the whole screen
every time.

If it instead only updates some parts of the screen,
the solution presented in this document is ABSOLUTLY NOT OPTIMAL SPEED!
(but hey it is easy to use ;)

If you do a 3D engine that puts solid color polygons, the solution
is also NOT OPTIMAL, because you could use hardware accelerated
horizontal lines for this.
of course everything that could be hardware accelerated won't
be hardware accelerated in this solution!

the program must use 640x480x8 mode.
the interface cannot do other modes (but might be easy to change)


IMPLEMENTATION:

it works on every GGI target. so it also works when your
Xserver has 16bit instead of 8.

i will first tell how to do it, and then WHY this is the
fastest possible solution.


HOW TO DO IT:


void FAST_init();

do FAST_mainmem = (uint8*) malloc(640*480);
open a ggi visual 640x480x8. if this is not possible,
open a ggi visual 640x480x8 with the palemu target
and warn the user that the program would run MUCH faster
if he used a 8bit capable display. (e.g. Xserver)
put the visual in asyncronous mode (very important for speed)
try to open two frames. if this is not possible, you
should open only one frame which is always possible. see below.
do not do anything with directbuffers, it's useless. (see below why)


void FAST_deinit();

free(FAST_mainmem);
close all the ggi stuff you opened.


void FAST_updatescreen();

use ggiPutBox to put the FAST_mainmem into the invisible frame.
call a ggiFlush on your visual.
then you switch the display to the visible frame.
(with ggiSetDisplayFrame and ggiSetWriteFrame)
if you do not have 2 frames, you just ggiPutBox into
the visible frame.
(you will then get a little flicker like in old vga320x200 days)


WHY THIS IS THE FASTEST AND BEST SOLUTION

you could use directbuffers. (if you have 2 frames !)
  advantage:
    - after painting you need not put the mainmem buffer onto
      the graphics card, since it is already there.
  disadvantage:
    - every access goes through the slow pci bus.
      if you do for example texture mapping and put pixel
      after pixel, this will be slower than
      doing texture mapping in the mainmem buffer and
      putting the whole buffer in 32bit bursts onto
      the graphics hardware
    - access is slow. this is bad when you for example
      first paint the background and then stack several big sprites
      on top of it.
      i.e. if writing in the directbuffer takes 3 sec.
      and writing in the mainmem buffer takes 1 sec.
      and ggiPutBox the mainmem buffer onto the directbuffer takes 4 sec.
      then if you only write once, writing in the directbuffer
      will take only 3 seconds. but writing in the mainmem buffer
      will take 1+4=5 seconds in the whole. so directbuffer wins.
      but if you do 10 times overwrite, writing in the directbuffer
      takes 3*10=30 seconds, but in the mainmem buffer it takes
      1*10+4=14 seconds. so mainmem buffer wins.
    - if you do 32bit or 16bit access, some hardware cannot
      do this. (you can check this in some ggi structures)
      so you will have to fallback to 8bit.
      so in every of your graphics routines that use more
      than 8bit at a time, you must deal with this problem
      and also write some code for smaller access.
      On other architectures like the ALPHA, your program
      will die with a SIGBUS error if your 32bit or 16bit accesses
      are not aligned ! So again you would have more complicated
      hardware dependend specialization in your routines.
    - ggiPutBox is always maximally optimized for the
      current kind of hardware. it will do burst accesses
      or whatever possible. this is really much more efficient
      than if directly fiddle around in the directbuffer,
      believe me ;)

why not use a mem-target instead of the mainmem buffer and
do a ggiCrossBlit with it ?
  because this is exactly the same thing. it is exactly the same speed.
  but it is more complicated to code, so why should you do it.
  ggi folks told me that CrossBlit could do DMA bursts on some hardware.
  but if CrossBlit from mem-target to display does DMA bursts
  and ggiPutBox from malloced memory to display doesn't,
  then i think it is a ggi implementation problem.
  in theory, the mem-target couly be a malloced buffer somewhere
  in special area of memory which makes specially fast DMA bursts
  possible. then CrossBlit would be faster. If this ever happens, please
  tell me.

why use palemu target if visual is not 8 bit, why not use
the not 8 bit visual and do a ggiCrossBlit from my 8bit mem-target visual
to the not 8 bit visual ?
  because palemu is faster.
  eh. this depends ;) some of the ggi folks say CrossBlit is faster than
  palemu. I think the palemu is faster, because it can hold palette
  dependend structures and calculate them once but use often.
  the crossblit must recalculate them each time you call a crossblit.
  so palemu should be faster.
  once i receive additional information, i will post it here.
  but until then, use palemu.
(erikyyy at erikyyy dot de, Erik Thiele)
how to do fastest possible access with ggi

Back to the homepage