Re: UTF-8, UTF-16 and UTF-32

Eljay Love-Jensen <eljay@xxxxxxxxx> · Fri, 22 Aug 2008 06:21:17 -0500

Hi Dallas,

> Thank you for taking the time to reply to my earlier message. The problem is
> that I have when processing 16-bit strings in my source code, and from
> processing other 16-bit text files, and having to write all the 16-bit
> library functions myself, and the L"" notation that returns a 32-bit
> string - so I need to declare all strings like unsigned short string[] =
> {'H','e','l','l','o',' ','W','o','r','l','d',0}; and finally, it the fact
> that both type are called wchar_t and I can't redefine the type.

Yes, that is what you have to do.  Although I'd use C99-ism uint16_t rather
than unsigned short.

As a convenience, you could write a MakeUtf16StringFromASCII routine, to
convert ASCII to Utf16String:

#if MAC
typedef UniChar Utf16EncodingUnit;
#elif WIN
#ifndef UNICODE
#error Need -DUNICODE
#endif
typedef WCHAR Utf16EncodingUnit;
#else
typedef uint16_t Utf16EncodingUnit;
#endif

typedef std::basic_string<Utf16EncodingUnit> Utf16String;

Utf16String s = MakeUtf16StringFromASCII("Hello world");

NOTE: Windows WCHAR may be wchar_t or may be unsigned short.  The TCHAR
flippy type is either WCHAR or char, depending on -DUNICODE or not.

> I am not saying to drop support for UTF-32 in favour for UTF-16, but I am
> saying that with Microsoft's and Macintosh's decision to support 16-bit
> Unicode files, and the fact the Vista is not fully supporting multibyte
> characters, I am forced to use Unicode. If GCC chooses not to support 16-bit
> strings, or does so in such a way which doesn't enable portability between
> Windows and Linux, then I will be in all likely hood forced to stop porting
> to Linux.
> 
> The choice is:-
> a) Windows + Linux + Solaris; or
> b) Windows.
> 
> It is delusional to think that there is a third choice.

A C or C++ L"blah blah" string literal is not Unicode.

A wchar_t is not Unicode.

C and C++ do not support Unicode.  (They also don't not support it.  Rather,
it's unspecified.)  Neither UTF-8, UTF-16, or UTF-32 are supported.

If you want Unicode string handling in your code, L"blah" is the wrong way
to go about it.  It's "wrong" in the sense that it does not do what you want
it to do.

(I really really *WISH* it did do what you want it to do, since I want that
facility too.  But, alas, it does not.  It's not portable as Unicode, since
it's not Unicode in any portable / guaranteed / reliable sense.)

> Adding support for 16-bit strings will not break any current ABI
> convention

Yes, it does violate ABI.  The platform ABI is not specified by GCC.

Sincerely,
--Eljay