Re: UTF-8, UTF-16 and UTF-32

me22 <me22.ca@xxxxxxxxx> · Sun, 24 Aug 2008 02:45:41 -0400

On Sun, Aug 24, 2008 at 01:53, Dallas Clarke <DClarke@xxxxxxxxxxxxxx> wrote:
> At the sake of sound repetitive, the problems are:-
> 1) using wchar_t for 4-byte string stuffs up function overloading - for
> example problems with shared libraries written with different 2-byte string
> type (i.e., short, unsigned short, struct UTF16{ typedef uint16_t Type; Type
> mEncodingUnit;};, etc)

Of course, people relying on sizeof(short)*CHAR_BIT being 16 aren't
technically portable anyways...

using wchar_t for 2-byte string stuffs up function overloading - for
example problems with shared libraries written with different 4-byte
string type (i.e., int, unsigned long, struct UTF16{ typedef uint32_t
Type; Type mEncodingUnit;};, etc)

> 2) casting error from declaring strings as unsigned short string[] =
> {'H','e','l','l','o',' ','W','o','r','l','d',0} or unsigned short *string =
> (unsinged short*)"H\0e\0l\0l\0o\0 \0W\0o\0r\0l\0d\0\0";

Why do you care how you're defining strings?  Anything that needs
localization belongs in an external "resource file" anyways, and
anything in the code can be done perfectly fine with wstring s =
L"Whatever you want here, including non-ASCII", and the compiler will
be fine with it, so long as your locale is consistent.

I'm really not convinced that this is a real problem, since there are
many C++ projects using wxWidgets that compile fine in "Unicode" mode
in both Windows and Linux, where writing a localizable string just
means saying _T("your text here") and everything works perfectly fine.

> 3) pointer arithmetic bugs and other portability issues consuming time,
> money and resources.

Anyone doing explicit pointer manipulation on strings in
application-level code deserves their problems.  And UTF-32 actually
has fewer possible places for pointer errors, since incrementing a
pointer actually moves to the next codepoint, something that neither
UTF-8 nor UTF-16 allow.

> 4) no standard library for strings functions, creating different behaviours
> from different implementations. And the standard C-Library people will not
> implement the string routines until there is a standard type for 16-bit
> strings offered by the compiler.

The ISO standards for C and C++ may not provide it, but that's
certainly not something that GCC can change.  If this is a problem,
you should have submitted a proposal to the Standards Committees.

Regardless, there are plenty of mature, open, and free libraries for Unicode.

> The full set of MS Common Controls no longer support the -D _MBCS, this
> means I must compile in with -D UNICODE and -D _UNICODE, this makes all the
> standard WINAPI to use 16-bit Unicode strings as well. Rather than
> constantly convert between UTF-8 and 16-bit Unicode I am moving totally to
> 16-bit Unicode. Why is MS doing this - probably because they know your not
> supporting 16-bit Unicode and that will force people like me to drop plans
> to port to Linux/Solaris because it is just too hard.

Well, "MS Common Controls" obviously aren't available on Linux either.
 Since you need to change basically your whole GUI to port it, you'd
be using a cross-platform library (like wxWidgets) that handles all
this for you.

> Once again, there are no legacy issues because no one is currently using
> 16-bit Unicode in GCC, it does not exist. Adding such support will not break
> anything. I am not arguing to stop support for 32-bit Unicode. Secondly
> object code does not use the label "wchar_t", meaning the change would force
> people to do a global search and replace "wchar_t" to "long wchar_t" before
> their next compile. Quite a simple change compared to what I must do to
> support 16-bit strings in GCC.

On the compilation side, sure, though I really don't think
search-and-replace works as well as you think it does in that
situation.  (Certainly s/int/long int/ isn't safe.)

But that would introduce a huge amount of issues.  It would mean that
changing the C library on a box to your proposed new one would break
every single binary there that used its wchar_t functions, for
example.

> It would be nice to substitute "long wchar_t" for "wchar_t" as it would not
> only be consistent with MS VC++, but also definitions of double and long
> double, long and long long, and integers as 123456789012LL. Using S"String"
> or U"String" would be too confusing with signed and unsigned.
>

BTW, are you following the standards process for C++0x?  Check
http://herbsutter.spaces.live.com/blog/cns!2D4327CC297151BB!214.entry
or the actual paper,
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html

u"string" and U"string" are actually good choices because they mirror
the \u and \U escape sequences in wide-character strings from C++98.

> The issues of confusion between Unicode Text Format (UTF) 8, 16 and 32, are
> not only mine, but as pointed out earier, they are constantly changing. The
> 16-bit string is a format I am forced to deal with and there is no support
> from GCC at all. I can't tell you if MS Unicode is the older style fixed
> 16-bit or it is the newer multibyte type similar to the UTF-8 definition.

I think you need to re-read the standards.  First of all, UTF is
"Unicode Transformation Format", not Text Format.  The standards are
quite specific about what UTF-8, UTF-16, and UTF-32 are.  There are
other encodings, UCS-2 for example, which are different.

> And in case you don't already know, MS VC++ compiles source code written in
> 16-bit Unicode, allowing function name, variables and strings to be written
> in 16-bit Unicode. This means that more and more files/data is going to be
> 16-bit Unicode. Developers like myself are going to have to deal with the
> format, whether we like it or not.

And GCC quite happily compiles code written in UTF-8, allowing
variable names and strings to be written in Unicode.  (Note that
there's no such thing as "16-bit Unicode".)  I suspect it'll compile
UTF-32 code as well, with similar results.

> So I have to ask - what are your arguments for not providing support for all
> three, 8-bit, 16-bit and 32-bit Unicode strings?
>

First, none of those things exist.  But I don't think I ever said that
providing support for UTF-8, UTF-16, and UTF-32 is such a terrible
thing, though I do think that it's somewhat pointless, since there are
already mature, capable libraries that do what you need, and the
cost/benefit quotient of providing it in the compiler is far too high.
 (And completely undesirable on many embedded platforms that GCC, C,
and C++ support.)

Why not change wchar_t to UTF-16?  Largely because while it might make
a vapourware project of yours easier, it creates the same problem you
have about having now with Microsoft dropping UTF-8, except without a
feasible upgrade path.  Also, UTF-32 is more convenient to deal with,
as illustrated by the strchr example.

(Though technically unicode fundamentally has to be dealt with
thinking about multi-element characters, because of irreducible
combining codepoints, hardly anything actually supports those, so
pretending that UTF-32 is one character per element is safe in effect.
 I don't think the fonts that come with Windows even include glyphs
for combining codepoints.  That said, SIL worldpad and a few others
actually do things properly, so really one might as well just use
UTF-8 everywhere.)

> P.S. I suggest that the strings default to the same type as the underlining
> file format, other wise it can be overridden by expressively stating:-
> A"String", L"String", LL"Sting".
>

Terrible idea, since it means that the legality of char const *p =
"hello world"; changes depending on the encoding of my file.  Encoding
is just that, and shouldn't change semantics.  (Just like whitespace.)
 An image should look the same whether saved in PNG or TGA, and a
program should do the same thing whether saved in UTF-16 or UTF-32.