Which is the best LLM for coding QML?
January 28, 2025 by Peter Schneider | Comments
LLMs have different levels of skills in QML coding. This post is about which LLM is the best choice to assist developers.
We at Qt have made it our mission to support our customers in the entire product development lifecycle. In today's world of Generative AI, we need to enable our customers to connect to an LLM of their choice. The Qt AI Assistant was designed with this flexibility in mind. Furthermore, we want to help our customers choose a suitable LLM for QML programming. Therefore, we have created a variety of coding benchmarks to measure the QML coding performance of LLMs.
Developing the QML100FIM Benchmark
You might be aware of the HumanEval coding benchmark, a commonly used coding benchmark developed by OpenAI. HumanEval consists only of Python language programming challenges. Other code generation benchmarks, such as MBPP, do not help much in measuring the quality of generated QML code. Hence, we put together a benchmark to compare QML coding skills. To be precise, we decided to create several coding benchmarks. One for measuring the ability to write code based on a human-language prompts, the QML100 Benchmark. And one benchmark for scoring code generated in the Fill-In-the-Middle method, the QML100FIM Benchmark. Let’s take a closer look at the QML100FIM Benchmark.
What does the QML100FIM measure?
QML100FIM is intended to measure code completion using the Fill-In-the-Middle (FIM) method. FIM means that the LLM receives a code generation request with a prefix and suffix. The prefix consists of the code before the cursor position in the code editor. The suffix contains the code after the cursor. The job of the LLM is to guess what code fits best in the middle between the prefix and the suffix. A suffix is not mandatory because the developer might be at the end of the code. Hence, the QML100FIM consists of tasks with prefixes, and some have suffixes.
HumanEval has 164 coding tasks. We aimed at 100 coding tasks. Why less than one of the most common benchmarks? While it is easy to automate HumanEval testing, which consists of tasks with a correct or wrong response, scoring code for UI controls is not straightforward. QML code might compile and be free of runtime errors, but it still makes no sense. For example, a MouseArea over a TextField might block the TextField from its primary function but enable the recognition of a hover event. Hence, we are ultimately reviewing the code generation results with the QML linter and manually.
What kind of coding tasks does QML100FIM include?
QML100FIM contains coding tasks that cover two main aspects: It includes tasks to cover a wide variety of Qt Quick Controls such as buttons, sliders, spin boxes, input fields, mouse interactions, signals, methods, animations, validators, dialogs, and so on. In addition, it includes a variety of tasks representing common coding challenges, such as adding a tooltip when hovering or marking a text input as mandatory.
The tasks challenge the LLM to complete unfinished words, lines, functions, and objects. It includes tasks where the LLM needs to understand a comment in the prefix (// Capitalizes the first letter of each word). The tasks check its ability to understand a common logic, such as a second text input field following another asking for the first name, which should ask for the surname, or it needs to draw a path for the right side of a heart shape. It includes tasks to add the correct brackets after an object or insert a missing QML library import.
QML100FIM is currently focused on creating 2D UIs. QML libraries that are tested include QtQuick, Controls, Dialogs, Layouts, Shapes, Graphs, and Effects. Libraries Multimedia, PDF, Location, Position, or Quick3D are not tested.
What is considered a successful code completion?
We tested the generated code to ensure compliance with the Qt 6.8 release in Qt Creator. It must not contain any errors identified by qmllint and must make sense. It may fail to add closing brackets to complete the object or the QML app, something that all LLMs have challenges with, some more, some less.
Which LLMs have been benchmarked so far?
Not every large language model supports FIM code generation. The pre-training data and the model architecture need to support FIM code generation. For example, while Llama 3.3 70B can nicely write QML code based on an English language prompt, it does not support FIM code completion.
In our first selection of LLMs to be compared, we wanted to cover the most common LLMs, which are offered commercially on cloud subscription (OpenAI and Anthropic) or royalty-free that can be run in a private cloud or locally (Meta CodeLlama 13B). We also decided to test end-to-end coding assistants such as Cursor and GitHub Copilot, even if we are unsure which LLMs are generating the code. There is still more to be done to compare more locally running models and some of the newest contestants, such as DeepSeek Coder V2, but that’s for another time.
What is the best LLM for QML code completion?
The best LLM for QML code completion is Anthropic’s Claude 3.5 Sonnet. The best royalty-free LLM for self-hosting is CodeLlama 13B.
Here are the results for FIM code completion:
Noteworthy is the performance of Cursor's QML code completion. We deactivated all other LLMs in Cursor's LLM settings than the Cursor Small LLM for this benchmark. Surprisingly, the benchmark score is the same as the Claude 3.5 Sonnet LLM score. Is Cursor routing your code (and your IPR) to Anthropic even when you explicitly configured it not to? We don't know. At least, they retain the right to do so in their security documentation.
Conclusion
Selecting an LLM for your business is never easy. There are questions about the deployment model (local, third-party cloud, private cloud). There are considerations of IPR protection and fine-tuning opportunities. There are considerations of inference time (how long it takes to get a code suggestion). Ultimately, the coding performance should always play an important role in your decision.
The differences in QML code completion performance of the bigger models we tested are not huge. Maybe, that's not that surprising when everybody works with similar training data. For QML programming, if you want a commercial third-party cloud service, then Claude 3.5 Sonnet with the Qt AI Assistant is currently our recommendation. If you want to run locally or in a private cloud without paying for each token and not sharing your code, then CodeLlama 13B is a good choice.
Coming soon: Which LLM is best for QML code generated with a human language prompt? Spoiler alert: The performance differences are bigger for prompt-based code generation than for FIM code completion.
Blog Topics:
Comments
Subscribe to our newsletter
Subscribe Newsletter
Try Qt 6.8 Now!
Download the latest release here: www.qt.io/download.
Qt 6.8 release focuses on technology trends like spatial computing & XR, complex data visualization in 2D & 3D, and ARM-based development for desktop.
We're Hiring
Check out all our open positions here and follow us on Instagram to see what it's like to be #QtPeople.