nanogui: Thread: Call for action - source code needed!

Subject: Call for action - source code needed!
From: Greg Haerr ####@####.####
Date: 18 Oct 1999 17:31:08 -0000
Message-Id: <01BF195C.16214CF0.greg@censoft.com>

On Monday, October 18, 1999 10:57 AM, Michael Engel ####@####.#### wrote:
: mwin runs nicely so far, but a little slow ...

: I will produce some new mwin stuff today that you can try.

All,
	I'm glad to hear that more of the list is trying out microwindows
on their palm pc's and other hardware.  It's nice to see this stuff used,
and I welcome the comments.  In regards to speed, basically, everything
comes down to two routines, drawhorzline and bitblt.  At their lowest
level, for 8bpp and 16bpp, these routines ultimately rely on memcpy,
or a wmemcpy.  Coding this routines as while(--cnt >=0) *dst++ = *src++;
greatly slows them down, so I call memcpy.

	What I am looking for are inline versions of byte, word, and
double word memcpy's, for the 8bpp, 16bpp and 32bpp.  We need inline
so that the procedure call overhead is minimized.  In addition, the memcpy
routines need to check for odd or unaligned data, move it, then move to
double-word moves for the main loop, then end with moving odd or unaligned data.
I'd bother to write all these routines, but I'm looking for some __fast__ routines,
that are known to work...  Any pointers or submissions would be appreciated.
This will _definitely_ speed up microwindows.

Greg

Subject: Re: Call for action - source code needed!
From: Alan Cox ####@####.####
Date: 18 Oct 1999 17:44:05 -0000
Message-Id: <E11dGfK-0000Cw-00@the-village.bc.nu>

> 	What I am looking for are inline versions of byte, word, and
> double word memcpy's, for the 8bpp, 16bpp and 32bpp.  We need inline

For most platforms you won't beat the glibc memcpy functions. On a single
issue CPU you may want to look at the X macros (but keep a bucket handy)
that write these operations as a duffs device. In fact for a 640 pixel wide
4bit fram buffer you can quite sanely expand the duffs to device out so you
do one pass (80 32bit ops) of


		movel D0, (A0)+
		movel D0, (A0)+
		...
		ret

You load D0 with the colour pattern fix the ends by hand then jmp to the
right point in the unrolled copy loop. You can unroll copies as well as colour
sets this way.

You may want to get people to profile the binaries with gprof before you get
deep into this - be sure its not something more fundamentally dumb going on
in the clipping code or similar.

Alan

Subject: RE: Call for action - source code needed!
From: Greg Haerr ####@####.####
Date: 18 Oct 1999 17:52:55 -0000
Message-Id: <01BF195F.0E0FB080.greg@censoft.com>

: You may want to get people to profile the binaries with gprof before you get
: deep into this - be sure its not something more fundamentally dumb going on
: in the clipping code or similar.

Alan - thanks for the quick response.  The unrolled copy loop with a jump
into the middle sounds very interesting; I'll look at glibc also and perhaps just
inline that stuff.

	Writing the bitblt turned out to be quite hard to do write, and it's
still not totally right, especially because of some clipping issues.  So,
I have a quick routine (actually close to the one you originally modified of Dave's)
that determines whether the entire bitblt area is completely unobscured, or not.
In the completely visible case, I can see major speed differences depending on
the implementation of memcpy.  In the partially obscured case, I have
to resort to bit blitting by reading and writing every pixel, which is _slow as hell_.
The answer to the latter is having the engine chop each portion of the bitblit
rectangle into completely visible regions, and recursively calling bitblit on
that rectangle, but that was a bit much for this last weekend!!

Greg

Subject: Re: Call for action - source code needed!
From: Alan Cox ####@####.####
Date: 18 Oct 1999 17:56:09 -0000
Message-Id: <E11dGrP-0000Eu-00@the-village.bc.nu>

> In the completely visible case, I can see major speed differences depending on
> the implementation of memcpy.  In the partially obscured case, I have
> to resort to bit blitting by reading and writing every pixel, which is _slow as hell_.
> The answer to the latter is having the engine chop each portion of the bitblit
> rectangle into completely visible regions, and recursively calling bitblit on
> that rectangle, but that was a bit much for this last weekend!!

You may want to look at X11 here. X has a nice algorithm that builds a 
rectangle list from a set of clipping data and tends to output lots of wide
rectangles.

Alan

Subject: Re: Call for action - source code needed!
From: "Frank W. Miller" ####@####.####
Date: 18 Oct 1999 18:03:51 -0000
Message-Id: <199910181748.NAA22284@macalpine.cornfed.com>

> For most platforms you won't beat the glibc memcpy functions. On a single

The *BSD kernel bcopy routines are quite fast as well, also assembly
and unencumbered.

Later,
FM

--
Frank W. Miller
Cornfed Systems Inc
www.cornfed.com