On Tue, Aug 26, 2008 at 14:08, Jim Cobban <jcobban@xxxxxxxx> wrote: > The definition of a wchar_t string or std::wstring, even if a wchar_t is 16 > bits in size, is not the same thing as UTF-16. A wchar_t string or > std::wstring, as defined by by the C, C++, and POSIX standards, contains ONE > wchar_t value for each displayed glyph. Alternatively the value of strlen() > for a wchar_t string is the same as the number of glyphs in the displayed > representation of the string. > One wchar_t value for each codepoint -- glyphs can be formed from multiple codepoints. (Combining characters and ligatures, for example.) > In these standards the size of a wchar_t is not explicitly defined except > that it must be large enough to represent every text "character". It is > critical to understand that a wchar_t string, as defined by these standards, > is not the same thing as a UTF-16 string, even if a wchar_t is 16 bits in > size. UTF-16 may use up to THREE 16-bit words to represent a single glyph, > although I believe that almost all symbols actually used by living languages > can be represented in a single word in UTF-16. I have not worked with > Visual C++ recently precisely because it accepts a non-portable language. > The last time I used it the M$ library was standards compliant, with the > understanding that its definition of wchar_t as a 16-bit word meant the > library could not support some languages. If the implementation of the > wchar_t strings in the Visual C++ library has been changed to implement > UTF-16 internally, then in my opinion it is not compliant with the POSIX, C, > and C++ standards. > The outdated encoding that only supports codepoints 0x0000 through 0xFFFF is called UCS-2. ( See http://en.wikipedia.org/wiki/UTF-16 ) ~ Scott