Fast-Booting Qt Devices, Part 3: Optimizing System Image

It is now time for the third part of the fast-boot blog post series. In the first post we showed a cluster demo booting in 1.56 seconds, in the second post we opened up how the Qt application was optimized. Here, we will concentrate on the optimization of boot loader and kernel for NXP i.MX6 SABRE development board which was used in the demo.

Before even starting the boot time optimization I measured the unoptimized boot time which was a whopping 22.8 seconds. After measuring we set our goal to get boot time under 2 seconds. Now that the goal was set. I started the optimization from the most obvious place: root-fs. Our existing root-fs contained a lot of stuff that was not required for the startup demo. I stripped down the whole root-fs from 500 MB to 24 MB. I used buildroot to create a bare minimal root-fs for our device and a cross-compile tool chain.

After switching to the smaller root-fs, I did a new measurement of the startup time which was now 15.6 seconds. From this 15.6 seconds kernel, startup took around 6 seconds, the U-Boot bootloader and the unmodified application the rest. Next, I concentrated to the kernel. As I already knew the functionality required by the application, I could easily strip down the kernel from 5.5 MB to 1.6 MB by removing nearly everything that was not required. This got the boot time to 9.26 seconds out of which the kernel startup was taking 1.9 seconds.

At this point we still had not touched the u-boot at all, meaning it had the default 1 second wait time and integrity check of the kernel in place. So U-Boot was next obvious target. Inside U-Boot there is special framework called secondary program loader which is capable of booting another U-Boot or specially configured kernel. I enabled the SPL mode and modified my kernel to include command line arguments and appended my device tree to the kernel. I also stripped down the device tree from 47 KB to 14 KB and disabled console. Boot time was dropped to 3.42 seconds where kernel was taking 0.61 seconds and U-Boot + application rest.

Now that the basic system (u-boot and kernel) was booting already in a decent time, I optimized our cluster application. Start up of the application was changed to load the cluster frame first and then animate in gauges and last the 3D car model as described in our previous post. Boot time was still quite far away from the 2 second target so I did more detailed analysis of the system. I was using class 4 SD card which I changed to class 10 card.

My Qt libraries were still shared libraries so I compiled Qt as static libraries and recompiled our cluster demo using the static version of Qt. This also allowed me to remove the shared libraries from the root-fs. Using static linking makes startup of application faster since operating system do not need to solve symbols of dynamic libraries. With static linking I was able to get the cluster application into one binary with size of 19 MB.  This includes all the assets (3D model, images, fonts) and all the Qt libraries required by the demo.  I actually forgot to use the proper optimization flags for my Qt build so I set optimization for size and removed fpic as a result executable size was reduced to 15 MB. I also noticed that having the root-fs on the eMMC was faster than having it on SD card.

However, having the u-boot and kernel image on SD card was faster than having both in eMMC, so I ended up to a bit weird combination where CPU is loading u-boot and kernel from SD card and kernel uses root-fs from eMMC. Kernel was still packged with gzip. After testing out UPX, LZO and LZ4 I changed packing algorithm to LZO which was fastest on my hardware. Depending on hardware you might want to test other algorithms or having no packing at all.  After changing the packing algorithm and removal of serial console the kernel image size was dropped to 1.3 MB. With these changes the boot time was reduced to 1.94 seconds.

If this would be a production software there is still work to be done in the area of memory configuration. U-boot should be debugged to understand why it takes more time to power up and load the kernel image from eMMC rather than from SD card. In general if quick startup time is a key requirement, the hardware should be designed accordingly. You could have small very fast flash containing the u-boot & kernel directly accessed by the CPU and then having the root-fs a bit slower flash like eMMC. 

Even tough I succeeded to get under 2 seconds I still wondered if I could make it faster. I stripped down the kernel a little bit more by removing the network stack ending up to 1.2 MB kernel with appended device tree. I also ran prelinking to my root-fs because the Vivante drivers come as modules, so I was not able to create static root-fs. I also striped the u-boot spl part a bit, initially it was 31 KB and after removing unwanted parts I ended up with 23 KB boot loader. With these final changes I was able to get the system to boot up in 1.56 seconds.

As a wrap-up here is how the boot time was reduced by different means.

chart2

Last thing that will also affect the boot time is hardware selection. There is a difference between the boards how fast they power up even if they are using the exact same CPU. Perhaps later something more about this.

Do:

  • Measure and analyze where time is spent
  • Set target goal, as early as possible
  • Try to reach the goal early in the development and then keep the level throughout development
  • When designing your software architecture take into account the startup targets
  • Optimize easy parts first, then continue to the details
  • Leverage static linking if that provides better result in your SW & HW configuration
  • Take into account your hardware limitations, preferably design the hardware to allow fast boot time

Do not:

  • Overestimate the performance of your selected hardware. i.MX28 will not give you iPad-like performance.
  • Complicate your software architecture. Simpler architecture runs faster.
  • Load things that are not necessary. Pre-built images contain features for many use cases, so optimization is typically needed.
  • Leave optimization at the end of the project
  • Underestimate the effort required for optimizing the very last milliseconds

So, that concludes our fast-boot blog post series. In these three posts, I showed you that Qt really is up for the task: It is possible to make Qt-powered devices to boot extremely fast to meet industry criteria. It's actually quite manageable when you know what you're doing but instead of one silver bullet, it's a combination of multiple things: good architectural SW design, bunch of Qt Quick tips'n'tricks, suitable hardware and a lot of system image optimization. Thank you for following!

 


Blog Topics:

Comments

?
Ho
0 points
108 months ago

Another image optimization:
Your personal image in a small page like this is about 6.2 MB, so I suggest optimizing this to 6.2 KB in the first place. A thousand times better. :-)

?
Risto Avila
0 points
108 months ago

Thanks! I reduced the size to around ~300k should be at least a lot better than the original :)

?
Ho
0 points
108 months ago

Yeah, It's much better now. :-)
I mentioned this because I think there are several issues remaining in the new website of qt.io. The website is mostly mobile-friendly, but on desktop, it is not well suited. For example in the main page, in my 1280 pixel wide display, the menu simply hides, although there is more than enough space available to show the full menu horizontally. You have to scroll down to the bottom of the main page to see what is new about Qt. It is good for a first-time visitor, but not much for a returning customer, developer, etc. which comes back to see what's new. The hidden menu is also a problem here.

I think the website needs more polishing, including a thumbnailer for www.qt.io/blog that changes your big picture into a small 90x90 pixel avatar to show beside your posts. As you are using wordpress, there may be such thing already available, and your wordpress theme designer should use it.

Some old pictures from labs.trolltech.com, qt-project.com and nokia.com can be relinked to the new website, and stuff like this. See this old post as sample:
http://www.qt.io/blog/2007/...
which contains this:
http://labs.trolltech.com/b...
There are wordpress plugins that help you keep track of the broken links and eventually fix them by appropriate changing queries. See this for example:
https://wordpress.org/plugi...

?
Lioric
0 points
108 months ago

Depending on the platform, you can get away without using u-boot (and just use the first stage stripped down version, X-Loader) and boot the kernel directly.

As your article points, avoiding crc Kernel checking is a big time saver

There is no need to strip console logging (printk) in the kernel build, just pass the "quiet" kernel comand argument to switch it off at boot time, and still have a boot log for when needed (log messages are maintained kernel space in the ring buffer, so it is really fast and doesn't affect boot time)

?
Risto Avila
0 points
108 months ago

Sure or you could even use arm-kernel-shim if you do not like u-boot. Now how many ms you save having arm-kernel-shim compared to u-boot spl I can't say.
Sure printk could be there but you save quite lot (in size) by disabling the serial console altogether. In my demo example there is no way to connect to the device so in my case we do not need the printk at all.

?
Lioric
0 points
108 months ago

Addtionally, hopefully the Quick precompiler provides that much needed initialization optimization.

Our target boots in 1.2s but then parsing and loading the complete initial UI (and its services, just the bare minimun to provide the initial system UI) takes from 1.5 to 2.x secs extra, defeating the whole "to the last drop" boot optimization, so we rely on several tricks to hide this from the user

Probably we will try the Novomok Qml compiler, as the Quick Compiler when released in Qt 5.8 will need (in our case) a complete update including the toolchain

?
Risto Avila
0 points
108 months ago

1.5 - 2.x seconds extra from starting Qt application sounds quite a lot. We also offer consultancy if you want our experts to take a look at your Qt application architecture and debug it further you can contact us or one of our partners see https://www.qt.io/services/

?
Ho
0 points
108 months ago

Why not use an uncompressed kernel? It can give several milliseconds, according to how fast the CPU is.
See this:
http://free-electrons.com/d...
http://elinux.org/Boot_Time

?
Risto Avila
0 points
108 months ago

I'm also using dtb embedded to zImage so I need to compression. Reason for using the dtb appended is that I can read the kernel + dtb with one go instead of doing two different reads to the sd card. Compression vs no compression depends if you are limited by the cpu or the bus bandwidth in our case the bottle neck seems to be the sd card.

?
Ho
0 points
108 months ago

The fastest time to access qt I've seen was in Genode. It was in blink of an eye! Check the live ISO:
https://genode.org/download...
I don't know how this L4 microkernel os works, but it is interesting.

?
Alex
0 points
108 months ago

I'm very curious as to how that cluster application is made. I have no idea how to feasibly develop such complex UI with Qt even using widgets, let alone - QML. Perhaps, a tutorial on a similar (but, of course, simpler) UI?

Also, what are the advantages of using QML over Qt widgets?

?
Risto Avila
0 points
108 months ago

The UI it self is not that complex. It contains few images, 3D model and some OpenGL shaders to color moving parts. If you are interested from QML I suggest you check out our examples that come with Qt or if you are book person you can take a look https://qmlbook.github.io.

It really depends what you are doing if you want to use QML or Widgets. Forexample I would use Widgets for desktop application and QML for games, mobile applications and embedded linux applications.

?
Alex
0 points
108 months ago

Thanks! I have seen some examples (and will study more), but I have not seen that book.

?
Himanshu Patel
0 points
108 months ago

Why not use an uncompressed kernel?