Which is the best LLM for QML code completion?

January 28, 2025 by Peter Schneider | Comments

LLMs have different levels of skills in QML coding. This post is about which LLM is the best choice to assist developers.

We at Qt have made it our mission to support our customers in the entire product development lifecycle. In today's world of Generative AI, we need to enable our customers to connect to an LLM of their choice. The Qt AI Assistant was designed with this flexibility in mind. Furthermore, we want to help our customers choose a suitable LLM for QML programming. Therefore, we have created a variety of coding benchmarks to measure the QML coding performance of LLMs.

Developing the QML100FIM Benchmark

You might be aware of the HumanEval coding benchmark, a commonly used coding benchmark developed by OpenAI. HumanEval consists only of Python language programming challenges. Other code generation benchmarks, such as MBPP, do not help much in measuring the quality of generated QML code. Hence, we put together a benchmark to compare QML coding skills. To be precise, we decided to create several coding benchmarks. One for measuring the ability to write code based on a human-language prompts, the QML100 Benchmark. And one benchmark for scoring code generated in the Fill-In-the-Middle method, the QML100FIM Benchmark. Let’s take a closer look at the QML100FIM Benchmark.

What does the QML100FIM measure?

QML100FIM is intended to measure code completion using the Fill-In-the-Middle (FIM) method. FIM means that the LLM receives a code generation request with a prefix and suffix. The prefix consists of the code before the cursor position in the code editor. The suffix contains the code after the cursor. The job of the LLM is to guess what code fits best in the middle between the prefix and the suffix. A suffix is not mandatory because the developer might be at the end of the code. Hence, the QML100FIM consists of tasks with prefixes, and some have suffixes.

HumanEval has 164 coding tasks. We aimed at 100 coding tasks. Why less than one of the most common benchmarks? While it is easy to automate HumanEval testing, which consists of tasks with a correct or wrong response, scoring code for UI controls is not straightforward. QML code might compile and be free of runtime errors, but it still makes no sense. For example, a MouseArea over a TextField might block the TextField from its primary function but enable the recognition of a hover event. Hence, we are ultimately reviewing the code generation results with the QML linter and manually.

What kind of coding tasks does QML100FIM include?

QML100FIM contains coding tasks that cover two main aspects: It includes tasks to cover a wide variety of Qt Quick Controls such as buttons, sliders, spin boxes, input fields, mouse interactions, signals, methods, animations, validators, dialogs, and so on. In addition, it includes a variety of tasks representing common coding challenges, such as adding a tooltip when hovering or marking a text input as mandatory.

The tasks challenge the LLM to complete unfinished words, lines, functions, and objects. It includes tasks where the LLM needs to understand a comment in the prefix (// Capitalizes the first letter of each word). The tasks check its ability to understand a common logic, such as a second text input field following another asking for the first name, which should ask for the surname, or it needs to draw a path for the right side of a heart shape. It includes tasks to add the correct brackets after an object or insert a missing QML library import.

QML100FIM is currently focused on creating 2D UIs. QML libraries that are tested include QtQuick, Controls, Dialogs, Layouts, Shapes, Graphs, and Effects. Libraries Multimedia, PDF, Location, Position, or Quick3D are not tested.

What is considered a successful code completion?

We tested the generated code to ensure compliance with the Qt 6.8 release in Qt Creator. It must not contain any errors identified by qmllint and must make sense. It may fail to add closing brackets to complete the object or the QML app, something that all LLMs have challenges with, some more, some less.

Which LLMs have been benchmarked so far?

Not every large language model supports FIM code generation. The pre-training data and the model architecture need to support FIM code generation. For example, while Llama 3.3 70B can nicely write QML code based on an English language prompt, it does not support FIM code completion.

In our first selection of LLMs to be compared, we wanted to cover the most common LLMs, which are offered commercially on cloud subscription (OpenAI and Anthropic) or royalty-free that can be run in a private cloud or locally (Meta CodeLlama 13B). We also decided to test end-to-end coding assistants such as Cursor and GitHub Copilot, even if we are unsure which LLMs are generating the code. There is still more to be done to compare more locally running models and some of the newest contestants, such as DeepSeek Coder V2, but that’s for another time.

What is the best LLM for QML code completion?

The best royalty-free LLM for self-hosting is CodeLlama 13B-QML. The best commercially hosted LLM for QML code completion is Anthropic’s Claude 3.7 Sonnet.

Here are the results for FIM code completion:

CodeCompletionPerformanceFeb2025

Noteworthy is the performance of Cursor's QML code completion. We deactivated all other LLMs in Cursor's LLM settings than the Cursor Small LLM for this benchmark. Surprisingly, the benchmark score is the same as the Claude 3.5 Sonnet LLM score. Is Cursor routing your code (and your IPR) to Anthropic even when you explicitly configured it not to? We don't know. At least, they retain the right to do so in their security documentation.

Conclusion

Selecting an LLM for your business is never easy. There are questions about the deployment model (local, third-party cloud, private cloud). There are considerations of IPR protection and fine-tuning opportunities. There are considerations of inference time (how long it takes to get a code suggestion). Ultimately, the coding performance should always play an important role in your decision.

If you can run a LLM in a private cloud, then CodeLlama-13B-QML is your best choice. The model is available royalty-free for self-hosting from Hugging Face. If you want a commercial third-party cloud service, then Claude 3.7 Sonnet with the Qt AI Assistant is currently our recommendation.

(Updated on 31.01.2025: Addition of CodeLlama-13B-QML results and adjusted conclusion)

(Updated on 28.02.2025: Addition of Claude 3.7 Sonnet results)

Which is the best LLM for QML code completion?

Developing the QML100FIM Benchmark

What does the QML100FIM measure?

What kind of coding tasks does QML100FIM include?

What is considered a successful code completion?

Which LLMs have been benchmarked so far?

What is the best LLM for QML code completion?

Conclusion

Blog Topics:

Comments

Subscribe to our newsletter

Subscribe Newsletter

Try Qt 6.9 Now!

We're Hiring

Read Next

Qt Extension 1.4.0 for VS Code released

Qt Creator 16.0.1 released

Sneak Peek: Exploring the Qt AI Inference API (Proof-of-Concept)