Re: UTF-8, UTF-16 and UTF-32

Jim Cobban <jcobban@xxxxxxxx> · Tue, 26 Aug 2008 14:08:07 -0400

Dallas Clarke wrote:
This was never a debate between the merits of UTF-8 verses UTF-16 
verses UTF-32, nor was it a debate about the merits of Linux/Solaris 
verses Windows, it was a discussion about the lack of standardized 
support for UTF-16 in gcc and with the move by Microsoft to deprecate 
UTF-8 in favour for UTF-16, it would make life increasingly difficult 
for all gcc developers to process UTF-16.
The following is my understanding of how the industry got to the 
situation where this is an issue.

Back in the 1980s Xerox was the first company to seriously examine 
multi-lingual text handling, at their famous Palo Alto Research Centre, 
in the implementation of the Star platform.  Xerox PARC quickly focussed 
in on the use of a 16 bit representation for each character, which they 
believed would be capable of representing all of the symbols used in 
living languages.  Xerox called this representation Unicode.  Microsoft 
was one of the earliest adopters of this representation, since it 
naturally wanted to sell licenses for M$ Word to every human being on 
the planet.  Since the primary purpose of Visual C++ was to facilitate 
the implementation of Windows and M$ Office it incorporated support, 
using the wchar_t type, for strings of Unicode characters.  Later, after 
M$ had frozen their implementation, the ISO standards committee decided 
that in order to support processing of strings representing non-living 
languages (Akkadian cuneiform, Egyptian hieroglyphics, Mayan 
hieroglyphics, Klingon, Elvish, archaic Chinese, etc.) more than 16 bits 
were needed, so the adopted ISO 10646 standard requires a 32 bit word to 
hold every conceivable character.

The definition of a wchar_t string or std::wstring, even if a wchar_t is 
16 bits in size, is not the same thing as UTF-16.  A wchar_t string or 
std::wstring, as defined by by the C, C++, and POSIX standards, contains 
ONE wchar_t value for each displayed glyph.  Alternatively the value of 
strlen() for a wchar_t string is the same as the number of glyphs in the 
displayed representation of the string.

In these standards the size of a wchar_t is not explicitly defined 
except that it must be large enough to represent every text 
"character".  It is critical to understand that a wchar_t string, as 
defined by these standards, is not the same thing as a UTF-16 string, 
even if a wchar_t is 16 bits in size.  UTF-16 may use up to THREE 16-bit 
words to represent a single glyph, although I believe that almost all 
symbols actually used by living languages can be represented in a single 
word in UTF-16.  I have not worked with Visual C++ recently precisely 
because it accepts a non-portable language.  The last time I used it the 
M$ library was standards compliant, with the understanding that its 
definition of wchar_t as a 16-bit word meant the library could not 
support some languages.  If the implementation of the wchar_t strings in 
the Visual C++ library has been changed to implement UTF-16 internally, 
then in my opinion it is not compliant with the POSIX, C, and C++ standards.

Furthermore UTF-8 and UTF-16 should have nothing to do with the 
internals of the representation of strings inside a C++ program.  It is 
obviously convenient that  a wchar_t *  or std::wstring should contain 
one "word" for each external glyph, which is not true for either UTF-8 
or UTF-16.  UTF-8 and UTF-16 are standards for the external 
representation of text for transmission between applications, and in 
particular for writing files used to carry international text.  For 
example UTF-8 is clearly a desirable format for the representation of 
C/C++ programs themselves, because so many of the characters used in the 
language are limited to the ASCII code set, which requires only 8 bits 
to represent in UTF-8.  However once such a file is read into an 
application its contents should be represented internally using wchar_t 
* or std::wstring with fixed length words.  Full compliance with ISO 
10646 requires that internal representation to use at least 32 bit words 
although a practical implementation can get away with16 bit words.

--
Jim Cobban   jcobban@xxxxxxxx
34 Palomino Dr.
Kanata, ON, CANADA
K2M 1M1
+1-613-592-9438