Re: UTF-8, UTF-16 and UTF-32

me22 <me22.ca@xxxxxxxxx> · Tue, 26 Aug 2008 14:24:43 -0400

On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@xxxxxxxx> wrote:
> The definition of a wchar_t string or std::wstring, even if a wchar_t is 16
> bits in size, is not the same thing as UTF-16.  A wchar_t string or
> std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE
> wchar_t value for each displayed glyph.  Alternatively the value of strlen()
> for a wchar_t string is the same as the number of glyphs in the displayed
> representation of the string.
>

One wchar_t value for each codepoint -- glyphs can be formed from
multiple codepoints.  (Combining characters and ligatures, for
example.)

> In these standards the size of a wchar_t is not explicitly defined except
> that it must be large enough to represent every text "character".  It is
> critical to understand that a wchar_t string, as defined by these standards,
> is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in
> size.  UTF-16 may use up to THREE 16-bit words to represent a single glyph,
> although I believe that almost all symbols actually used by living languages
> can be represented in a single word in UTF-16.  I have not worked with
> Visual C++ recently precisely because it accepts a non-portable language.
>  The last time I used it the M$ library was standards compliant, with the
> understanding that its definition of wchar_t as a 16-bit word meant the
> library could not support some languages.  If the implementation of the
> wchar_t strings in the Visual C++ library has been changed to implement
> UTF-16 internally, then in my opinion it is not compliant with the POSIX, C,
> and C++ standards.
>

The outdated encoding that only supports codepoints 0x0000 through
0xFFFF is called UCS-2.  ( See http://en.wikipedia.org/wiki/UTF-16 )

~ Scott