Re: UTF-8, UTF-16 and UTF-32

Andrew Haley <aph@xxxxxxxxxx> · Tue, 26 Aug 2008 19:32:12 +0100

Jim Cobban wrote:

> Furthermore UTF-8 and UTF-16 should have nothing to do with the
> internals of the representation of strings inside a C++
> program.  It is obviously convenient that a wchar_t * or
> std::wstring should contain one "word" for each external glyph,
> which is not true for either UTF-8 or UTF-16.  UTF-8 and UTF-16
> are standards for the external representation of text for
> transmission between applications, and in particular for
> writing files used to carry international text.  For example
> UTF-8 is clearly a desirable format for the representation of
> C/C++ programs themselves, because so many of the characters
> used in the language are limited to the ASCII code set, which
> requires only 8 bits to represent in UTF-8.

Just in case anyone thinks that UTF-16 might be a good format for saving
data in files or for data to be sent over a network, here's a gem from
Microsoft:

'The example in the documentation didn't specify Little Endian, so
the Unicode string that the code generates is Big Endian.  The
SQL Server Driver for PHP expected Big Endian, so the data
written to SQL Server is not what was expected.  However, because
the code to retrieve the data converts the string from Big Endian
back to UTF-8, the resulting string in the example matches the
original string.

'If you change the Unicode charset in the example from "UTF-16"
to "UCS-2LE" or "UTF-16LE" in both calls to iconv, you'll still
see the original and resulting strings match but now you'll also
see that the code sends the expected data to the database.'

http://forums.microsoft.com/msdn/ShowPost.aspx?PostID=3644735&SiteID=1

Andrew.