nanogui: Thread: Ways to speed up a simple application?

Subject: Ways to speed up a simple application?
From: "Ricardo P. Jasinski" ####@####.####
Date: 13 Mar 2009 02:42:08 -0000
Message-Id: <ee9633130903121941n2de7bfd9g515fc1960119f60e@mail.gmail.com>

Hi everyone,
we are currently developing an application using nano-X, FLTK2 and uClinux;
our hardware platform is a 32-bit Nios-II processor running at 100 MHz.

Our application presents a series of screens sequentially. Each screen
consists of some text, shown over a background image (jpg/png). Whenever the
user presses a button, the application should advance to the next screen.
This screen switching process should happen as fast as possible.

We have a working prototype, and it looks really great when the application
is running, but unfortunately there's a noticeable lag when switching
between the screens. I have profiled it (with my stopwatch) and determined
that it takes up to 8 seconds to flip from one screen to the next (this
figure is for 1.000 characters drawn on the screen).

The screen changing takes place inside a callback function, which is
activated when the user presses any button (this button is a flt::Button
widget). We have profiled this callback, and it takes about 100 milliseconds
altogether, including a call to fltk::redraw(). It is evident that most of
the time is spent after the callback is finished, when the process returns
to the fltk::run() loop.

Just to be clear, all windows are created when the application is
initialized, which takes about 30 seconds. During this period, all fonts and
images are created (with SharedImage::get and fltk::setfont). Afterwards,
the screen changing is done by updating the widgets (labels) and bringing to
front the appropriate window  (via GrRaiseWindow and GrSetFocus).

I'll give a little more detail further in this message, but the essential
question is: con you think of any way (or ways) to speed up this process? I
really feel that it shouldn't be taking that long just to draw a few strings
of text.

Here are some suggestions of my own:
   - create duplicate windows and update the widgets before the user presses
any button. This way, the screens would be drawn beforehand, and not take
any time after the button press. The drawback is that many duplicate windows
would have to be created and updated, since the text contents also depend on
the button that has been pressed
   - profile and optimize any drivers and low-level code where all this time
might be being spent
   - draw everything off screen and later just copy it to the visible
region, somehow
   - (your ideas here, please!)

Here's a little more detail:
  - screen resolution: 600x800 (portrait mode), framebuffer
(Packed-16bit-5/5/5)
  - amount of text per screen: usually between 50 and 250 characters (UTF8)
  - all text is drawn as the "text" attribute of fltk::Button and
fltk::Widget objects
  - we are using a slightly modified fbportrait_left.c file, provided by Uwe
Klatt in this mailing list (thanks, Uwe!)
  - the screen is not drawn any faster when there's not a background image
  - the application doesn't need to process mouse input; users interact via
special hardware buttons which, when pressed, notify the application via
GrInjectKeyboardEvent. The fltk::Button widget receives this event because
the Button::shortcut attribute is set accordingly.

Another thing that I'd appreciate to have some confirmation on is: we are
running the application mode in portrait mode (setportrait left). I have a
slight impression that things would be faster in "normal" (landscape) mode.
Does it make any sense?

Please guys, any help on this matter would be greatly appreciated, since in
the current stage our application is not really "usable". We are running out
of ideas that can be easily tried, so before starting any other tests I
would really like to hear from you guys!

Thanks a lot for any thoughts,

Ricardo Jasinski.

Subject: Re: [nanogui] Ways to speed up a simple application?
From: "Ricardo P. Jasinski" ####@####.####
Date: 13 Mar 2009 18:09:48 -0000
Message-Id: <ee9633130903131108x1b98e963i82d1cba3aa9e249f@mail.gmail.com>

Hello Michael,

our FPGA hardware and kernel image were developed around the wiki example
for the DE2 board (unfortunately, we don't have any NEEK around yet).

We got good response times using nano-X only (this is, without FLTK).
Windows displaying only text strings were updated almost immediately. As for
dragging a window around on the screen, it was also performed very fast. We
noticed that increasing Nios cache memories had a great effect on this.

I wish I knew more about this whole "blit" thing, since it always comes up
when people are discussing performance issues. From what I've read in the
Nano-X Architecture document, it consists in drawing things off-screen and
later performing a fast copy to the framebuffer memory. This raises a few
questions:

   1) How exactly does it make things run faster? Aren't we just adding
another step to the process? As I understand, in our hardware the system's
video memory is just a region in the SDRAM chip. The uClinux/nano-X screen
driver simply writes to it. Then the video controller hardware reads this
same region via DMA and generates SVGA signals. If we had two different
memory chips working at different speeds, I would agrre that this could make
things faster, but in our case it this necessarily true?

    2) I agree that with the FPGA we have a great flexibility and it would
be relatively easy to change the hardware if we have to. Can you give an
idea of what this hardware accelerator should do?

Thanks for your comments and ideas,

Ricardo.


2009/3/13 Michael Schnell ####@####.####

> I'm not an expert on this at all, but I do have a nano-X application
> running on NIOS (in fact a NEEK dev-kit). If you have a NEEK, to you might
> want to try it, it's available in the Wiki as an Altera Application Loader
> application.
>
> Here I move a text window around on the screen. It uses about a quarter of
> the screen  and after the location change it in fact needs two seconds for
> updating the screen.
>
> I suppose the only way to speed this up is providing hardware support
> instead of doing a "dumb" framebuffer. AFAIK, nano-X does support hardware
> accelerated screens and it should be quite easy to to hardware support for
> things like "blit" in the FPGA.
>
> -Michael
>



-- 
-------------------------------------------------------
Ricardo Pereira Jasinski
####@####.####
Tel: (41) 9955-2852

LME - Laboratório de Microeletrônica da UTFPR
   UTFPR Microelectronics Lab
   www.lme.cpdtt.cefetpr.br
   Tel: +55 41 3310 4756

Subject: Re: [nanogui] Ways to speed up a simple application?
From: "Ricardo P. Jasinski" ####@####.####
Date: 13 Mar 2009 18:40:23 -0000
Message-Id: <ee9633130903131139h193df532w70d2d1c560aced52@mail.gmail.com>

Hi John,

thanks for sharing your thoughts. Please let me elaborate a little more on
the description of our system.

Our hardware has two video outputs: one regular DB-15 connector for standard
PC monitors (SVGA resolution) and one LVDS output for a TFT LCD display
(LG-Philips LB104S01-TL01) which is always attached.

The hardware that we based our design on featured only the SVGA output; so,
we just added a custom hardware component that converts the VGA signals to
the serial LVDS interface used by the LCD. It has yielded excellent results
and there is no flickering at all.

I think I've used a poor choice of words when I said "switching between
screens"; we are actually switching between application windows, which
happen to be maximized and take up the full screen. Sorry if it mislead you.

It is visually noticeable that what is taking so much time is the drawing
part of the process. The screen (window) goes blank for a while, and then it
gets updated almost at once (from right to left, since we are in portrait
mode). The time it stays blank is roughly proportional to the number of
elements (e.g., characters) that must be drawn.

Thanks again,

Ricardo.

2009/3/13 Bosch, John <...>

> Hi Ricardo Jasinski,
>
> I have not used nano-x but do get the emails and have worked on TFT
> displays as well as rasterized.  TFT are nice and have great clarity.
> The rasteriezed display was a major pain (very bad and contradictory
> documentation on the glass) with only okay results on clarity. And the
> processor chip did not really support rasterized as advertized. Current
> design has no problem flipping between screens instantaneously. You may
> be limited by the hardware design.
>
> I would optimize the driver.
> Use three buffers.  One that is always being copied from to the display
> buffer while the other is getting rendered to, after each copy to the
> display buffer flip which one is getting copied from and which one is
> getting rendered to.
>
> Render ---> BUFA
>
>            BUFB -------> DISPLAYBUFF
>
>
>
>            BUFA -------> DISPLAYBUFF
>
> Render ---> BUFB
>
>
> The display buffer should be DMA to the actual display.  The DMA should
> drive the flip, if I remember correctly.
>
> This works very well with no flicker in the display at all and no
> worries for the application level in how and when it updates the
> display.
>
>
>
>
> -----Original Message-----
> From: ####@####.#### ####@####.#### On Behalf Of
> Ricardo P. Jasinski
> Sent: Thursday, March 12, 2009 9:41 PM
> To: ####@####.####
> Subject: [nanogui] Ways to speed up a simple application?
>
> Hi everyone,
> we are currently developing an application using nano-X, FLTK2 and
> uClinux;
> our hardware platform is a 32-bit Nios-II processor running at 100 MHz.
>
> Our application presents a series of screens sequentially. Each screen
> consists of some text, shown over a background image (jpg/png). Whenever
> the
> user presses a button, the application should advance to the next
> screen.
> This screen switching process should happen as fast as possible.
>
> We have a working prototype, and it looks really great when the
> application
> is running, but unfortunately there's a noticeable lag when switching
> between the screens. I have profiled it (with my stopwatch) and
> determined
> that it takes up to 8 seconds to flip from one screen to the next (this
> figure is for 1.000 characters drawn on the screen).
>
> The screen changing takes place inside a callback function, which is
> activated when the user presses any button (this button is a flt::Button
> widget). We have profiled this callback, and it takes about 100
> milliseconds
> altogether, including a call to fltk::redraw(). It is evident that most
> of
> the time is spent after the callback is finished, when the process
> returns
> to the fltk::run() loop.
>
> Just to be clear, all windows are created when the application is
> initialized, which takes about 30 seconds. During this period, all fonts
> and
> images are created (with SharedImage::get and fltk::setfont).
> Afterwards,
> the screen changing is done by updating the widgets (labels) and
> bringing to
> front the appropriate window  (via GrRaiseWindow and GrSetFocus).
>
> I'll give a little more detail further in this message, but the
> essential
> question is: con you think of any way (or ways) to speed up this
> process? I
> really feel that it shouldn't be taking that long just to draw a few
> strings
> of text.
>
> Here are some suggestions of my own:
>   - create duplicate windows and update the widgets before the user
> presses
> any button. This way, the screens would be drawn beforehand, and not
> take
> any time after the button press. The drawback is that many duplicate
> windows
> would have to be created and updated, since the text contents also
> depend on
> the button that has been pressed
>   - profile and optimize any drivers and low-level code where all this
> time
> might be being spent
>   - draw everything off screen and later just copy it to the visible
> region, somehow
>   - (your ideas here, please!)
>
> Here's a little more detail:
>  - screen resolution: 600x800 (portrait mode), framebuffer
> (Packed-16bit-5/5/5)
>  - amount of text per screen: usually between 50 and 250 characters
> (UTF8)
>  - all text is drawn as the "text" attribute of fltk::Button and
> fltk::Widget objects
>  - we are using a slightly modified fbportrait_left.c file, provided by
> Uwe
> Klatt in this mailing list (thanks, Uwe!)
>  - the screen is not drawn any faster when there's not a background
> image
>  - the application doesn't need to process mouse input; users interact
> via
> special hardware buttons which, when pressed, notify the application via
> GrInjectKeyboardEvent. The fltk::Button widget receives this event
> because
> the Button::shortcut attribute is set accordingly.
>
> Another thing that I'd appreciate to have some confirmation on is: we
> are
> running the application mode in portrait mode (setportrait left). I have
> a
> slight impression that things would be faster in "normal" (landscape)
> mode.
> Does it make any sense?
>
> Please guys, any help on this matter would be greatly appreciated, since
> in
> the current stage our application is not really "usable". We are running
> out
> of ideas that can be easily tried, so before starting any other tests I
> would really like to hear from you guys!
>
> Thanks a lot for any thoughts,
>
> Ricardo Jasinski.
>

Subject: Re: [nanogui] Ways to speed up a simple application?
From: Michael Schnell ####@####.####
Date: 16 Mar 2009 11:09:59 -0000
Message-Id: <49BE3323.1040106@lumino.de>

I'm not an expert on this at all, but I do have a nano-X application
running on NIOS (in fact a NEEK dev-kit). If you have a NEEK, to you
might want to try it, it's available in the Wiki as an Altera
Application Loader application.

Here I move a text window around on the screen. It uses about a quarter
of the screen  and after the location change it in fact needs two
seconds for updating the screen.

I suppose the only way to speed this up is providing hardware support
instead of doing a "dumb" framebuffer. AFAIK, nano-X does support
hardware accelerated screens and it should be quite easy to to hardware
support for things like "blit" in the FPGA.

-Michael

Subject: Re: [nanogui] Ways to speed up a simple application?
From: Michael Schnell ####@####.####
Date: 16 Mar 2009 11:30:32 -0000
Message-Id: <49BE37A4.3010306@lumino.de>

> I wish I knew more about this whole "blit" thing, ...
This is easy :-) .

AFAIK, with "accelerated" graphics you distinguish between 
processor-memory-based and "card"-based memory. The CPU only can access 
the processor memory directly, but the card memory is behind some driver 
using the card's hardware interface and that makes access by the 
processor very slow, but allows very fast access by any graphic hardware 
function. Bitmaps can be either in the processor memory or in the card 
memory. If they are in the card memory, they can either be "on screen" 
(visible) or "off-screen" (invisible).

The "bilt" functions the driver provides now copy a bitmap (rectangle) 
from one memory location onto another, using the processor if only 
processor memory is involved, using some hardware "processor" on the 
card, if only card-memory is involved and using the driver 
hardware-interface if both are involved.

-Michael

Subject: Re: [nanogui] Ways to speed up a simple application?
From: Michael Schnell ####@####.####
Date: 16 Mar 2009 11:33:55 -0000
Message-Id: <49BE3837.8060109@lumino.de>

>    1) How exactly does it make things run faster? Aren't we just adding
> another step to the process? 
Copying stuff around in the "card" is done very fast by a hardware 
processor, the trade off is that moving stuff _into_ the card gets 
slower. So the idea is to keep as much bitmaps (and things like 
character definitions) in the card as possible.

-Michael

Subject: Re: Ways to speed up a simple application?
From: "Aaron J. Grier" ####@####.####
Date: 16 Mar 2009 19:23:27 -0000
Message-Id: <20090316192218.GC3628@arwen.poofy.goof.com>

On Fri, Mar 13, 2009 at 07:39:21PM +0100, Ricardo P. Jasinski wrote:
> It is visually noticeable that what is taking so much time is the
> drawing part of the process. The screen (window) goes blank for a
> while, and then it gets updated almost at once (from right to left,
> since we are in portrait mode). The time it stays blank is roughly
> proportional to the number of elements (e.g., characters) that must be
> drawn.

to mitigate this a little, don't clear the existing screen until the new
one is ready.  draw the new screen to an off-screen buffer, then copy
off-screen to on-screen.  this won't make the change operation any
faster, but it will mean less time staring at a blank screen.

it sounds like you need to look at the character draw routines.  are you
using truetype, or the plain bitmap fonts?

I recall putting quite a bit of work into speeding up font draws on the
branch at work, but only some of these changes made it back into the
official trunk.  the most radical outstanding change was the
introduction of a low-level text draw routine, which I heavily optimized
for the hardware at hand.  the code is available at
http://frye.com/~aaron/microwin-frye-nanosteve.1.tar.bz2 .  "nanosteve"
is just an internal name of the branch.  this is technically a cousin of
0.89.  merging with 0.90 and later has been on my todo list for years,
but since doing so won't help my employer ship product, it's necessarily
always at the bottom of the list.

one of the worst CPU-sucking offenders for text draw (which still
appears to be in the CVS tree) is drawbitmap, which draws
pixel-by-pixel.  the per-pixel overhead is a killer -- just moving
direct framebuffer access into the loop saves a significant number of
cycles.  see FB7k_DrawText() in drivers/scr_fryebox.c .

a lot of accelerated video cards also have the ability to do 1bpp
expansion to the native screen depth, and this can give even more speed.
see FB7kBitBLT_DrawText() in drivers/scr_fryebox.c .

in hindsight, creating a low-level DrawBitmap function (as indicated in
engine/devdraw.c) would likely give the widest benefit.

how deep down the rabbit hole do you want to go?  (=

-- 
  Aaron J. Grier | "Not your ordinary poofy goof." | ####@####.####

Subject: Re: [nanogui] Ways to speed up a simple application?
From: Michael Schnell ####@####.####
Date: 17 Mar 2009 08:34:32 -0000
Message-Id: <49BF5EA0.1000206@lumino.de>

> Can you give an
> idea of what this hardware accelerator should do?
>   
Take a look at the functions of a PC's video card.

-Michael

Subject: Re: [nanogui] Ways to speed up a simple application?
From: "Ricardo P. Jasinski" ####@####.####
Date: 17 Mar 2009 18:16:28 -0000
Message-Id: <ee9633130903171115p72b9c0f0kc7e74271dc4dd125@mail.gmail.com>

Thanks for the nice introduction, this is more than I could ask for!

Since we are apparently working on similar platforms, I would like to
ask you a few more questions if you don't mind...
  - this so-called "card memory", should it necessarily be implemented
with embedded ram blocks available inside the FPGA?
  - wouldn't we achieve similar results by instantiating a
tightly-coupled data memory and placing the crucial bitmaps/fonts in
it? (I'm not experienced enough to know what this change might
involve, though)

Since this hardware acceleration approach is starting to seem more
complicated than I'd have wished (i.e., involves more than a little
hardware and drivers tweaking), I think I will try other alternatives
before I get back to this path, but anyway I'd be really interested in
hearing your opinion about how we could take advantage of the FPGA in
order to speed things up a bit.

Thanks again,

Ricardo.


>> I wish I knew more about this whole "blit" thing, ...
>
> This is easy :-) .
> AFAIK, with "accelerated" graphics you distinguish between processor-memory-based and "card"-based memory. The CPU only can access the processor memory directly, but the card memory is behind some driver using the card's hardware interface and that makes access by the processor very slow, but allows very fast access by any graphic hardware function. Bitmaps can be either in the processor memory or in the card memory. If they are in the card memory, they can either be "on screen" (visible) or "off-screen" (invisible).
>
> The "bilt" functions the driver provides now copy a bitmap (rectangle) from one memory location onto another, using the processor if only processor memory is involved, using some hardware "processor" on the card, if only card-memory is involved and using the driver hardware-interface if both are involved.
>
> -Michael
>



--
-------------------------------------------------------
Ricardo Pereira Jasinski
####@####.####
Tel: (41) 9955-2852

LME - Laboratório de Microeletrônica da UTFPR
   UTFPR Microelectronics Lab
   www.lme.cpdtt.cefetpr.br
   Tel: +55 41 3310 4756

Subject: Re: [nanogui] Re: Ways to speed up a simple application?
From: Ricardo Jasinski ####@####.####
Date: 18 Mar 2009 01:48:26 -0000
Message-Id: <ee9633130903171847w72b82130p15fdb25d274d9155@mail.gmail.com>

> ---------- Forwarded message ----------
> From: Aaron J. Grier ####@####.####

> to mitigate this a little, don't clear the existing screen until the new
> one is ready.  draw the new screen to an off-screen buffer, then copy
> off-screen to on-screen.  this won't make the change operation any
> faster, but it will mean less time staring at a blank screen.

Good point, I'll look into that. However, I'm not sure how it could be
done, since all I do is update the button labels; everything else is
done after the callback function ends, and control returns to the
fltk::run loop.

> it sounds like you need to look at the character draw routines.  are you
> using truetype, or the plain bitmap fonts?

We are using freetype2, the only font we use is DejaVu Sans Bold at
48, 36, 30, 24 and 18 points. We call setfont() one time for each of
these values when starting up the application; if we didn't, the
application would hang for a while when the font was used for the
first time.

What do you think, should we implement these fonts as "built-in" /
"compiled-in" fonts? Do you believe this could result in a significant
speed boost?

> I recall putting quite a bit of work into speeding up font draws on the
> branch at work, but only some of these changes made it back into the
> official trunk.  the most radical outstanding change was the
> introduction of a low-level text draw routine, which I heavily optimized
> for the hardware at hand.  the code is available at
> http://frye.com/~aaron/microwin-frye-nanosteve.1.tar.bz2 .  "nanosteve"
> is just an internal name of the branch.

Thanks! I took a look at your code and the datasheet for the 506
graphics controller. Thanks to that and Michael's previous messages, I
can understand the role of the bitblt engine. As I see, sometimes you
can just sit back and feed its fifo with data. Nice!

> one of the worst CPU-sucking offenders for text draw (which still
> appears to be in the CVS tree) is drawbitmap, which draws
> pixel-by-pixel.  the per-pixel overhead is a killer -- just moving
> direct framebuffer access into the loop saves a significant number of
> cycles.  see FB7k_DrawText() in drivers/scr_fryebox.c .

I see. So, if I decide to start hacking the code this would be one of
the most promising starting points.

> a lot of accelerated video cards also have the ability to do 1bpp
> expansion to the native screen depth, and this can give even more speed.
> see FB7kBitBLT_DrawText() in drivers/scr_fryebox.c .

Awesome. Maybe this is something I can implement in our system, since
we are using a soft-core processor and custom instructions may be
added to its instruction set. Btw, maybe you have an idea for an
operation that could speed things up significantly if implemented in
hardware?

> in hindsight, creating a low-level DrawBitmap function (as indicated in
> engine/devdraw.c) would likely give the widest benefit.

Do you mean in case I had hardware accelerated graphics? Or is it
something I can do with the framebuffer only?

> how deep down the rabbit hole do you want to go?  (=

Let me think. I think I would start with the first choice in the list
below, and go down all the way to the bottom, until I achieve the
performance we need:
   - some structural change in my application source code
   - some hardware tweaking that could be done without touching any source code
   - driver optimizations that could be done without touching the
application code

And, what would be the most fun of all, implement some sort of
hardware acceleration, but I don't think I'd have the time to do that
within our current deadlines. Anyway, given enough dead ends, I've
seen many deadlines change...   :)

Thanks for sharing your thoughts!

Ricardo Jasinski.