Re: UTF-8, UTF-16 and UTF-32

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Dallas Clarke wrote:
This was never a debate between the merits of UTF-8 verses UTF-16 verses UTF-32, nor was it a debate about the merits of Linux/Solaris verses Windows, it was a discussion about the lack of standardized support for UTF-16 in gcc and with the move by Microsoft to deprecate UTF-8 in favour for UTF-16, it would make life increasingly difficult for all gcc developers to process UTF-16.
The following is my understanding of how the industry got to the situation where this is an issue.

Back in the 1980s Xerox was the first company to seriously examine multi-lingual text handling, at their famous Palo Alto Research Centre, in the implementation of the Star platform. Xerox PARC quickly focussed in on the use of a 16 bit representation for each character, which they believed would be capable of representing all of the symbols used in living languages. Xerox called this representation Unicode. Microsoft was one of the earliest adopters of this representation, since it naturally wanted to sell licenses for M$ Word to every human being on the planet. Since the primary purpose of Visual C++ was to facilitate the implementation of Windows and M$ Office it incorporated support, using the wchar_t type, for strings of Unicode characters. Later, after M$ had frozen their implementation, the ISO standards committee decided that in order to support processing of strings representing non-living languages (Akkadian cuneiform, Egyptian hieroglyphics, Mayan hieroglyphics, Klingon, Elvish, archaic Chinese, etc.) more than 16 bits were needed, so the adopted ISO 10646 standard requires a 32 bit word to hold every conceivable character.

The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 bits in size, is not the same thing as UTF-16. A wchar_t string or std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE wchar_t value for each displayed glyph. Alternatively the value of strlen() for a wchar_t string is the same as the number of glyphs in the displayed representation of the string.

In these standards the size of a wchar_t is not explicitly defined except that it must be large enough to represent every text "character". It is critical to understand that a wchar_t string, as defined by these standards, is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, although I believe that almost all symbols actually used by living languages can be represented in a single word in UTF-16. I have not worked with Visual C++ recently precisely because it accepts a non-portable language. The last time I used it the M$ library was standards compliant, with the understanding that its definition of wchar_t as a 16-bit word meant the library could not support some languages. If the implementation of the wchar_t strings in the Visual C++ library has been changed to implement UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, and C++ standards.

Furthermore UTF-8 and UTF-16 should have nothing to do with the internals of the representation of strings inside a C++ program. It is obviously convenient that a wchar_t * or std::wstring should contain one "word" for each external glyph, which is not true for either UTF-8 or UTF-16. UTF-8 and UTF-16 are standards for the external representation of text for transmission between applications, and in particular for writing files used to carry international text. For example UTF-8 is clearly a desirable format for the representation of C/C++ programs themselves, because so many of the characters used in the language are limited to the ASCII code set, which requires only 8 bits to represent in UTF-8. However once such a file is read into an application its contents should be represented internally using wchar_t * or std::wstring with fixed length words. Full compliance with ISO 10646 requires that internal representation to use at least 32 bit words although a practical implementation can get away with16 bit words.

--
Jim Cobban   jcobban@xxxxxxxx
34 Palomino Dr.
Kanata, ON, CANADA
K2M 1M1
+1-613-592-9438


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux