On UTF-8, Latin 1 and charsets
March 26, 2011 by Thiago Macieira | Comments
Yesteday, I blogged about my experiments trying to determine the feasability of replacing the default Latin 1 codec in QString with UTF-8. In fact, the text I had for the blog yesterday was much longer, so I concentrated on the the actual code and performance and left the background, rationale and details for today.
Let me quote myself from the introduction yesterday:
But I was left wondering: this is 2011, why are we still restricting ourselves to ASCII? I mean, even if you’re just writing your non-translated messages in English, you sometimes need some non-ASCII codepoints, like the “micro” sign (µ), the degree sign (°), the copyright sign (©) or even the Euro currency sign (€). I specifically added the Euro to the list because, unlike the others, it’s not part of Latin 1, so you need to use another encoding to represent it. Besides, this is 2011, the de-facto encoding for text interchange is UTF-8.
Background: the charsets mandated by the C++ standard
The C and C++ standards talk about two charsets: the source input charset and the execution charset. The GCC manual argues that there are actually four: they add the wide-character charset (which is nowadays always UTF-16 or UCS-4, depending on how wide your wide char is) and the charset that the compiler uses internally. For my purposes here, let's stick to the first two.
The source input charset is the one your source file is encoded in. In the early days of C, when charsets were very different from one another, like EBCDIC, it was very important to get this right, or the compiler wouldn't understand which bytes represented even a space or a newline. Today, one could write a compiler that assumed the input charset is ASCII and still get away with it. The input charset is used by the compiler when it loads your file into memory and translates it into a form that it can parse.
The execution charset is the one that your strings are encoded in when the compiler writes the object files. That is, if you write a word imported into English like "Résumé", the compiler needs to find a way to encode those "é". Note that the compiler has loaded the source file into memory and converted it into some internal format before compiling, so we are assuming here that the compiler has understood that those are LATIN SMALL LETTER E WITH ACUTE. How those "é" were on disk has nothing to do with how this blog is encoded.
The GCC manual says that the default for the input charset is the locale's charset, while the default for the execution charset is UTF-8. That's not exactly true: unless you specify otherwise, GCC will output exactly the same bytes as it found in the input. You can verify this easily by trying to compile a Latin1-encoded file while on an UTF-8 locale. As expected, it works. I guess that changing that would break too many programs, so the GCC developers didn't do it. But if you add any of the -finput-charset= ot -fexec-charset= options, even to the supposedly default ones, GCC will bail out if it finds something improperly encoded.
About a week or two ago, we had this dicussion in the #qt IRC channel on Freenode. This one developer wanted to know why QString used Latin 1 instead of the execution charset to decode the string literals. He also wanted to know why he couldn't simply write "u00fc" to mean the "ü" letter. Well, the answer is actually simple and two-fold:
- Qt and QString don't know what execution charset you chose when you compiled your source code
- The execution charset isn't constant: one object file can have a different charset from another object or library
QString today
If you look at how QString really works, you'll see that it has some support for a changeable execution charset. When I say that it defaults to Latin 1, I am implying that it can be changed. In the QString documentation, any functions that take a const char * refer to the QString::fromAscii() function. That function is actually a misnomer: it doesn't convert necessarily from ASCII -- in fact, the documentation says "Depending on the codec, it may not accept valid US-ASCII (ANSI X3.4-1986) input."
The function is called fromAscii because most source code today is written in ASCII. This function was actually introduced in Qt 3 (see the docs) and that was released in 2002. Back then, UTF-8 wasn't as widespread as it is today -- I remember switching to UTF-8 on my Linux desktop only in 2003. That meant that any file with non-ASCII bytes had a high chance of being misinterpreted when sent to someone across the world, but a low chance if you sent it to a colleague in the same country.
So small teams developing applications sometimes wanted to use those non-ASCII characters that I listed in the introduction: the degree symbol, the copyright symbol, etc. And to accommodate them, QString allows you to change the codec that it uses to decode the string literals.
In other words, the QTextCodec::setCodecForCStrings allows you to tell Qt what your execution charset is (problem #1 above). There's however nothing to help you with problem #2, so libraries have to stick to telling Qt in each function what codec their strings are in.
Enter C++0x with a (partial) solution: Unicode literals
The next standard of the C++ language, still dubbed C++0x even though we're already in 2011, contains a new way of writing strings that ensure that they are always encoded in one of the UTF charsets: the new string literals. So you can write code as:
u8"I'm a UTF-8 string."
u"This is a UTF-16 string."
U"This is a UTF-32 string."
And on the receiving side, QString will know that the encoding is UTF-8, UTF-16 and UTF-32 respectively, without a doubt. I mean, almost: the UTF-8 encoded string results in a const char[], which is no different from the existing string literals, so QString cannot tell one apart from the other. But the other two generate new types, respectively const char16_t[] and const char32_t[], which we can use in overloads and perfectly decode the string.
So the developer from IRC could write without fear u"u00fc" and be assured that QString would decode it as LATIN SMALL LETTER U WITH DIAERESIS (U+00FC).
My criticism of the C++0x committee is that they solved the problem only partially. I want to write u"Résumé" and send my file to a colleague using a different platform (like Windows). Moreover, I'd like his compiler to interpret my source code exactly as I intended. Of course, that means I'm going to encode my source file as UTF-8, so I'd like every single compiler to use UTF-8 as their source input charset.
The C++0x committee did not mandate that, nor did they include a way for me to mark my source file in such a way. The decoding of the source file really depends on the compiler's settings...
My preferred solution
In absence of being able to tell the compiler what my source code charset is, I'd settle for an efficient way of creating QStrings. Internally, QStrings store data as UTF-16 and that is not going to change. So we need to get the compiler to convert the source code literal to UTF-16. Using the C++0x new string literals, we can. And since those strings are in read-only memory that can never be unloaded, we can even do:
QString s = QString::fromRawData(u"Résumé");
Ok, so we can't write "é" as the compiler could mis-interpret it, so we might have to settle for:
QString s = QString::fromRawData(u"Ru00e9sumu00e9");
Which is still a bit too verbose for my taste. Yesterday, in my blog, someone suggested using macros to do the above. But if we use another feature of C++0x, the User-defined literals (see also the definition), we could define the following operator:
QString operator "" q(const char16_t *str, size_t len);
Which would allow me to write:
QString s = u"Résumé"q;
which looks weird, but is at least very clean. Unfortunately, the latest release of GCC as of today hasn't implemented it yet.
Update: A friend reminds me that Herb Sutter has reported in his blog that the March 2011 meeting of the C++ standards committee has approved the Final Draft International Standard for the C++ language. It should be voted in the Summer and become known as C++ 2011.
Blog Topics:
Comments
Subscribe to our newsletter
Subscribe Newsletter
Try Qt 6.9 Now!
Download the latest release here: www.qt.io/download.
Qt 6.9 is now available, with new features and improvements for application developers and device creators.
We're Hiring
Check out all our open positions here and follow us on Instagram to see what it's like to be #QtPeople.
Commenting for this post has ended.
Nice post. Very interesting.
Good to see this...
I sure hope C++0x does not see the light of day. As if C++ didn't have a steep learning curve, they want to add so much stuff that makes it much more obfuscated.
I would really appreciate a honest post from one of the trolls on their thoughts on C++0x.
Usually I am using QString::fromUtf8("∫xy dx = ∇⃗·z⃗") or something like that and I hope that there is not any " or backslash in the UTF-8-representation. :D
@Sekar
C++0x contains some must-have-features, like auto, decltype and typeof, moving constructors and variadic templates (well, they are not must-have, they do not add more possibilities of abstraction, but they improve compile-time significantly, ever tried to compile something with boost- or KDE-typelists?).
Thanks so much for the informative post. Would the user defined literals solution (using q) allow literals that are outside the latin1 codepage? For example, could I have a Hebrew string encoded in utf8 inside the quotes of your QString s = u"Résumé"q; example? Thanks!
@Sekar, @The User: I might make a post on some C++0x features. Or a series of posts. Sounds like a good idea.
Most of the core language enhacements have direct value for Qt, the user-defined literals and unicode strings f rom this blog, but also lambdas, decltype, typeof, rvalue references, rvalue this, type traits, etc. Some other core language enhancements have only marginal benefit, like atomics (Qt already has them, but we can offload to the compiler now) and move semantics (due to reference counting).
The library enhancements are not useful to us, and I don't count <tt><type_traits></tt> as part of the library.
@Joshua: the user defined literal is just an easy way for creating QStrings. Aside from the UTF-16 version, all the others would require memory allocation, so they're not as beneficial. The UTF-16 one knows that the string itself is already in memory so it could use fromRawData.
But the point I was trying to make by saying that it's only got partial support is that, if you write <tt>u"Résumé"</tt> in your source file and save it as UTF-8, those "é" will be saved as bytes 0xC3 0xA9. If the compiler interprets the source code as UTF-8, we'll have the word 0x00E9 in memory. If it interprets as Latin-1, we'll instead have words 0x00C3 0x00A9 instead. So we're still at the mercy of the compiler's setting...
That's why I'm going to assume that everyone has their compilers set to UTF-8.
Thank you for the timely update.
I knew of the user defined literals, and I thought they might be useful for QStrings the moment I saw them described in the standard. :)
I think I have read on the latest C++0x Committee mailing there was opposition on the feature, for some incompatibility with C... But if the standard is at the final draft, that should be settled.
The question that remains to be answered is ... when will we be able to use such a feature with Qt? The standard is not yet out, and even when it will be, historically it has not been considered safe to use new features for many years...
Now things seem to move a bit faster, but the question remains: how long will it take before such a feature could be used safely in portable Qt programs?
> I want to write u"Résumé" and send my file to a colleague using a different platform (like Windows).
I and haven't understood, this your colleague uses other Unicode table, specially developed in bowels of Microsoft. If isn't present, in what a problem with transmission of string u"Résumé"?
Replace "é" on "胡" and repeat the reasonings on that as the character will be presented if the abstract compiler interprets "胡" character as Latin1.
@Yuriy: my source code is encoded in UTF-8, which is the locale on any modern Unix desktop (Linux, Mac OS X, etc.). Windows, on the other hand, uses CP 1251, 1252, etc. as its 8-bit charset (it calls that "ANSI"). I have no clue how it would interpret my file, but if it doesn't read 0xC3 0xA9 as "é", I'll just blame Microsoft and the committee.
> u8"I'm a UTF-8 string."
> u"This is a UTF-16 string."
> U"This is a UTF-32 string."
Hmm...
The "u8" prefix looks good.
The "u" and "U" prefixes OTOH look really bad. I mean, when seeing u8 it is quite clear what it means, it's 8 bit.
It is completely not clear that "u" means 16 bit and "U" means 32 bit. Add to that that "u" and "U" a quite similar and shape and it could easily be overlooked if somebody uses "u" somewhere where he should have used "U" instead.
"u16" and "u32" would be much more obvious.
OTOH, u8, u16 and u32 really looks like they mean "unsigned int with 8/16/32 bit".
Alex
We have 2 computers with Win7. On one cp-1251, on other cp-1252. I write comments to the part of the code in Russian (cp-1251) and I transfer the file to other person (with cp-1252). Even if he knows Russian that he can read. TWO WINDOWS can't agree among themselves.
As your idea will solve this task. In any way. Even if you "correctly" will pack the initial cp-1251 characters - to present them on the second computer with cp-1252 not really.
And which compiler is not reproducing UTF-8 string literals simply written as "Hörner!" correctly? I feel the misconceptions about encodings lead to more and more complexity added to compilers and libraries. The entire point about UTF-8 was that you don't have to add explicit encoding logic to your compiler and tools.
@Frank: I don't know. I have only tested GCC and ICC and ICC 12 doesn't support Unicode literals yet. If I write wide chars, ICC uses the locale for input, unlike GCC.
The point is that I want to write <tt>u"Hörner!"</tt> and have that be the same as:
char16_t str[] = { 0x0048, 0x00f6, 0x0072, 0x006e, 0x0065, 0x0072, 0x0021, 0 };
That means that it needs to interpret the bytes 0xC3 0xB6 as the U+00F6 codepoint. So what I'm saying is that I will encode my files as UTF-8 and will expect every single compiler to interpret my files as such. And I'll consider broken any compiler that doesn't by default.
@Alex: Thiago's the wrong one to argue with about that; the right people would have been the C++ standards committee, and per the end of Thiago's post, you're a little late.
@Yuriy: In Thiago's defense here, GCC, LLVM, ICC, MSVC (2005 or newer), XCode, Sun Studio, and even Borland (which won't build Qt anyway, be quiet Thiago...) all handle UTF-8 with "BOM", and all of them except possibly ICC assume UTF-8 by default if there is no "BOM". (I believe ICC and Sun Studio default to the system encoding, but you can tell them on the command-line to use UTF-8.) All of them except ICC support UTF-16 source files as well, and GCC (and possibly others) even support UTF-32. As far as compilers that we likely care about for Qt, that leaves GCCE, and I can find neither diddly nor squat as to what it supports for source encodings.
Unicode string literals is a feature of the EDG frontend, but it is undocumented at Intel C++ compiler (/Qoption,cpp,"--uliterals" option enables it)
u8"", u"", U"" - GCC-4.4+, ICC-11+, Borland-2009+.
Others - tu tu.
The brave choice would be to switch to UTF-8 entirely (also for QString internal). I done some tradeoff computations some time ago: https://github.com/unclefra... .
@Frank: I disagree. My own benchmarks show that UTF-16 is a lot easier to handle and more efficient too. Converting from UTF-16 to UCS-4 is quite fast, much more than from UTF-8.
This year in my university we used the Bloodshed Dev-C++ compiler for a subject's project. To my surprise it didn't support UTF8.
This caused a lot of headaches, as we aren't in an English country and the subject let students work in a multiplatform environment (Windows with Dev-C++ or Linux with gcc).
It still surprises me how in 2011 there are still people using compilers which don't support UTF8. Hopefully this will change some day.
> This year in my university we used the Bloodshed Dev-C++
Are you talking about the "Bloodshed Dev-C++ 5 (currently beta)" IDE, that is dead since 2003? :-?
> the subject let students work in a multiplatform environment
I have used Qt Creator, in Windows and Linux, to work with standard C++ and with Qt, with good results.
Hi. This is interesting. Let me share some experiences:
In recent years I came think UTF-8 should be the best encoding for source code containing Unicode strings. All actual code remains ASCII so the compiler understands it as always, and there is freedom to write string literals in any language. With Qt that can be done. But I saw things like {QString price = QString::fromUtf8("One €");} too "cluttered", you end up with a lot of distracting code. It would be cleaner to just write {QString price = "One €";}.
In my own old and outdated little utility library that I sometimes use for experimenting, String were originally 8 bit in whatever the local encoding was. I wanted to partially support Unicode in order to support Windows in Unicode without breaking Ansi mode. UTF-8 has become almost the de-facto encoding for Unicode (XML, Web, Internet, Linux...) which I'm glad.
So I decided to make UTF-8 the default source encoding for code using my library, and that it would use UTF-8 also internally (yes, in memory). This last part I wasn't sure as most others were using UTF-16 internally (Qt, Java, Win32,...) but I tried. The problem is that in Windows, all calls to Win32 needing strings require UTF-16. So I made my strings automatically convert to UTF-16 on the fly when doing such calls. And to convert from UTF-16 when receiving strings back from the API. It turned out to be much faster than I expected. It can do things like:
String dirname = "Ñandú-€ Ελληνικά Эрзянь"; // all source in UTF-8
CreateDirectoryW(dirname, 0);
Programs intended to be compiled in ANSI mode will use ANSI encoding in the source and the rest work as expected.
The compiler does not know about UTF-8, it expects 8 bit characters in the system's encoding and so leaves them byte-by-byte untouched in memory. The program must know this and handle the literals correctly of course.
BTW, the D programming language uses UTF-8 in source code and internally, converts to UTF-16 when necessary. Yes, D forces you to write your source in Unicode UTF-8. I like that decision. I think it should be the default for all. And without an ugly u8"text" prefix.
Qt interprets literals as only ASCII by default when converting implicitly. That's safe. But I'd move to a default interpretation of UTF-8 in such cases in the future.
PS: don't forget to chcp 65001 your Windows console for proper UTF-8 handling :-)
This whole article made me understand the value and drawbacks of UTF-8. Until now it presented a huge problem for me, but now i's all clear. So thanks!