Using hardware acceleration for graphics

I am one of our QWS developers. QWS, the Qt Window System, is the heart of Qt for Embedded Linux, formally Qtopia Core, formally Qt/Embedded. :-) What's great about working on embedded is that you have a view of the system as a whole - the complete stack. So it's my job to have a pretty good idea how QPainter commands you write in a widget's paint event end up as voltage levels rapidly alternating between +3.3v and 0v on wires going to your LCD. While QWS handles all the usual window system tasks such as keyboard focus and mouse events, the biggest component is probably graphics. The window system is inherently part of the graphics stack and I actually spend a lot of my time working with the graphics team.

Over the last year our clear focus has been on performance. This performance push has been in all areas on all platforms and architectures. When it comes to graphics, there's a very broad range of hardware Qt can expect to see - from a simple MMU-less ARM with only a frame-buffer all the way to scary gamer PCs with thousands of graphics cards & neon lights installed. Our lives are made complicated because if we're not careful, one can end up writing code which runs faster on that little arm than it does on the gamer PC. This post hopes to explain why this is so.

Let's begin by categorising the range of hardware available:

First, we need to differentiate Unified Memory Architecture (UMA) devices from those with dedicated graphics memory. Generally, high-end hardware will have dedicated graphics memory whereas low-end devices will just use system memory (sometimes reserving a memory region, sometimes not). This is pretty strait forward on PCs - You can almost tell from a PC's price tag if it has dedicated graphics memory. Sadly, in the world of embedded devices, this is not the case. High-end devices often have UMA and low-end devices (especially set-top-boxes) have dedicated graphics memory.

The next differentiation is the graphics operations supported by the hardware. Generally they are wide ranging but can be loosely categorised as:

1) No acceleration (framebuffer only)
2) Blitter & alpha-blending hardware
3) Path based 2D vector graphics
4) Fixed-function 3D
5) Programmable 3D

Hardware with no acceleration whatsoever or a simple video overlay is the most common we see in embedded devices. This will always be the case until someone figures out how to design and manufacture silicon for free. Blitter and alpha-blending hardware is almost non-existent on desktops these days, but it does seem to still be around in the current generation of embedded hardware. Path based 2D vector graphics is pretty new and looks ready to replace blitter-only style hardware. NOTE: This does not refer to hardware which can draw a 1-pixel wide, non-anti-aliased, non-dashed, solid-colour line without clipping. Fixed-function 3D tends to be the older generation of desktop graphics processors. Generally, fixed function has pretty much been replaced with programmable 3D. This is even the case on mobile hardware.

So, there's five categories of operations and two types of memory architecture leading to ten different overall types of graphics hardware. I've collected an example of each, just so you know we don't make this stuff up. :-)

Type UMA Non-UMA
None Marvel PXA270 Various*
Blitter NXP PNX8935** Fujitsu Lime MB86276***
2D vector Freescale i.MX35
Fixed-3D Freescale i.MX31 nVidia GeForce 2
Programmable-3D TI OMAP3530 AMD Radeon HD 4600

* Various: Some devices use dedicated framebuffer memory to reduce load on the system memory bus
** NXP PNX8935: http://www.nxp.com/applications/set_top_box/ip_stb/stb225/
*** Fujitsu Lime MB86276: http://www.fujitsu.com/downloads/MICRO/fma/pdf/MB86276.pdf

The next question then becomes: How can Qt off-load graphics operations to these different types of hardware? Well this is done through Qt's QPaintEngine API. The idea is that Qt applications (& Qt itself of course) always uses QPainter, which in turn uses one of the paint engines. To take advantage of graphics acceleration, we write a new paint engine (like the OpenGL ES 2.0 engine we've added in 4.5.0). The advantage is that existing applications can benefit from new rendering back ends and new applications can still work on older or less advanced hardware (albeit with lower performance). There seems to be a misconception in the community that Qt is out-of-date because it has no OpenGL scene graph API. While that statement is technically correct, Qt does have QGraphicsView scene graph API which uses QPainter. Because it uses QPainter, if OpenGL (for example) is available, it can be used as the rendering back end.

So, now that's cleared up, what QPaintEngines are there and do we have all the hardware acceleration types covered by them?

Well, for devices with no acceleration, Qt will use it's raster paint engine. The raster engine has seen some very impressive optimizations in Qt 4.5, as Gunnar has previously blogged about. For higher end graphics hardware, there's usually a nice high-level API which is powerful enough to express all of QPainter. I.e. OpenGL & OpenVG. The trouble we've recently hit is the hardware in-between, I.e. those with blitters but not much else. Such hardware is not powerful enough to express the whole of QPainter, so we must fall back to the raster paint engine for unsupported operations. The raster paint engine needs a pointer to the memory it renders to (and reads from). On UMA systems, this is not a problem as the buffer is obviously in system memory (that's all there is!). It's on systems with dedicated graphics memory where the fun begins...

First, on many systems, you simply can not map graphics memory into your process' address space - The architecture simply has no way to do it. On such systems, the buffer must be copied to system memory, rendered to with the raster engine, then copied back. If this happens _every_ time you switch between a fall-back and the hardware, it's going to be _slow_!!

On some systems (particularly PowerPC for some reason?), the graphics controller sits on the SoC's external bus and can be addressed directly by an application. All that needs to happen is for the kernel to configure the process's page table to point to the graphics controller's memory range. It's then up to the graphics controller to access data in it's dedicated memory on behalf of the host CPU. Although this kind of set-up does allow the raster paint engine to get a pointer to graphics memory, all accesses go over this external bus - which is usually slow. On PC/x86 architecture, things get more even more complicated, the kernel has to fiddle with lots more hardware, cache controllers, PCI bus controllers, IOMMUs, etc. However, in all cases, if you're lucky enough to get a pointer to graphics memory, all access must go over a slower external bus.

So now we know what's going on, what conclusion can we draw? Well, reading & writing to external graphics memory is slow. If your on non-UMA, don't have OpenGL or OpenVG available, but do want to use your blitter then you'd better make sure your mostly using QPainter::drawPixmap(). NOTE: Graphics view's cache modes can help you out a lot there - see Andreas' previous posts! ;-) Otherwise falling back to the raster engine is going to be slow. Fortunatly, this type of hardware is (finally) on it's way out.

NOTE: I should also mention that there's a similar issue with X11. There's no API to get a pointer to an X pixmap and X11 does not provide enough API to implement the whole of QPainter. While the X11 paint engine does not inherit from the raster paint engine, it does make use of software fall-backs which involve copying the pixmap, executing the fall-back and then uploading the result. It's for that reason that we've added the raster graphics system which uses system memory (via the MITSHM extension) in Qt 4.5. On desktop, this is a fairly temporary measure until our OpenGL 2.0 engine & graphics system is in a fit state to take over all of Qt's rendering. No promises, but we hope that can happen for Qt 4.6. For X11 on low-end embedded devices (like the n810), MITSHM provides a pretty decent long term solution.

So, when we look to the future of Qt's graphics architecture and the required paint engines, I think we're well on the way to having all the bases covered:

Type UMA Non-UMA
None Raster Raster*
Blitter DirectFB DirectFB**
2D vector OpenVG*** OpenVG***
Fixed-3D OpenGL (ES) 1.x OpenGL (ES) 1.x
Programmable-3D OpenGL (ES) 2.x OpenGL (ES) 2.x****

* When using raster on NUMA, rendering is actually done in system memory first, then flushed to VRAM
** This is the one which is going to be slow when doing anything other than QPainter::drawPixmap()
*** It shouldn't be a big surprise we're researching an OpenVG paint engine!
**** Qt 4.5 contains a new paint engine for OpenGL ES 2.x which we're now making work on desktop OpenGL 2.0

I just want to finish by asking you to take another look at the above table. Do you notice anything interesting? All of the graphics systems (apart from DirectFB) are cross platform which means, when we make something faster in one engine, all platforms will benefit.


Blog Topics:

Comments

?
boulabiar
0 points
185 months ago

it is very nice that Qt is considering accelerating for web.

What lacks now is an acceleration for a full multimedia components.
I mean that QGraphicsScene should accept videos as elements (QGraphicsItem).
This way, Qt will advance WPF from Microsoft in these aspects.

But anyway, dealing with videos means dealing with codecs and their respective related patents and copyrights.

?
Benjamin
0 points
185 months ago

This is great.

Do you have any numbers of the improvement?
How does that affect the memory consumption? Does a transformation on big layer (let's say the <body> tag) involve caching a pixmap of the complete layer in memory?</body>

?
No'am Rosenthal
0 points
185 months ago

Benjamin:
The numbers for improvement would change per animation. Like I said, for the leaves demo I see the FPS improving by x4 on raster engine.
Re. memory consumption - this is all QGraphicsView cache, which relies on QPixmapCache, which can be tweaked by QPixmapCache::setCacheLimit(...). btw this is true not just for Webkit, but also for graphics-view based UIs that involve complex elements that have to be animated: The QPixmapCache::setCacheLimit(...) function would tweak the balance between RAM usage and paint performance. But yes, without caching that big layer as a pixmap - paint performance degrades because webkit has to re-render it for each frame.

?
No'am Rosenthal
0 points
185 months ago

Note that the leaves demo is rather unusual: the animation is running continuously. The normal use-case for CSS animations is animated transitions between states, which would clean up the layer and thus the image cache after it's done.

Some more numbers using n900 with raster engine: (FPS without accel -> FPS with accel, higher is better)
Translate animation on an HTML table: 10 -> 60
Rotate animation on an element with border-shadow: 7 -> 14
Opacity animation on inline text: 2 -> 18
Scale animation on a medium-size transparent image: 17 -> 24
Scale animation on a large image: 11 -> 18

So acceleration has the potential to give us x2 acceleration for rotation of painted elements, x1.5 for images, x6 for simple HTML elements and x10 for inline text.

?
Nick Young
0 points
185 months ago

Looking good, though I had trouble building the head revision..?

../../../WebCore/platform/graphics/qt/GraphicsLayerQt.cpp:690: error: explicit template specialization cannot have a storage class
../../../WebCore/platform/graphics/qt/GraphicsLayerQt.cpp:700: error: explicit template specialization cannot have a storage class

?
zchydem
0 points
185 months ago

Sounds interesting project. Can you share a video of that demo running on N900?

?
No'am Rosenthal
0 points
185 months ago

@zchydem: to your request, I uploaded a short video:
http://labs.qt.nokia.com/bl...
This runs at about 9 FPS, compared to 2 FPS without compositing. (still not enough, of course).

@Nick: please try again, I think my new commit should fix it. If it doesn't, please let me know which platform/environment you use so I can try to reproduce the problem...

?
Nick Young
0 points
185 months ago

No'am: That commit did the trick =)

The 9FPS on the N900 is not surprising. The CPU usage for that leaves demo is quite high - on this modern Quad Core desktop computer.
The best result I saw was with -graphics-system opengl, and even that seemed too high to expect smooth playback on a small device.

Still, a very good step up. Excellent work =)

?
aidan
0 points
185 months ago

@Nick: the leaves fall rather smoothly on the iphone that I'm posting from =)

?
Benjamin
0 points
185 months ago

@No'am
You should try the demo without compositing with the Qt-maemo branch, usually that doubles the FPS.

@aidan
Don't forget the iPhones have a low-resolution screen (153600 pixels against 384000 for a N900).

?
Diegol
0 points
184 months ago

Great work. Could these changes work in WinCE devices?