Re: UTF-8, UTF-16 and UTF-32

"Dallas Clarke" <DClarke@xxxxxxxxxxxxxx> · Thu, 21 Aug 2008 20:17:25 +1000

Hello Andrew,

Thanks for your reply, but with Pictorial languages such as Cantonese and
Mandarin, that have up to 60,000 character in the full set (one picture for
each word), using locality page sheets with UTF-8 is limited.

GCC and MS VC++ are now inconsistent with their wchar_t types and this
difference will make it nearly impossible for us to continue supporting
Linux, i.e. in a choice between Linux and Windows, I have to follow my
customers.

I am not trying to deny UTF-32 or saying that GCC should not support it, I
am saying that GCC should support all three Unicode formats because UTF-16
is a format that I have to deal with in the real world. Why not support all
three formats?

As someone with has written a scripting language based on C++, I can tell
you that changing the 'wchar_t' to something else would only take five
minutes - it wouldn't break any thing.

Dallas.

----- Original Message ----- 
From: "Andrew Haley" <aph@xxxxxxxxxx>
To: "Dallas Clarke" <DClarke@xxxxxxxxxxxxxx>
Cc: <gcc-help@xxxxxxxxxxx>
Sent: Thursday, August 21, 2008 7:28 PM
Subject: Re: UTF-8, UTF-16 and UTF-32

Dallas Clarke wrote:

Now I have had the time to pull myself off the ceiling, I realise the
problem is that Unix/GCC is supporting both UTF-8 and UTF-32, while
Windows is supporting UTF-8 and UTF-16. And the solution is for both
Unix and Windows to support all three Unicode formats.

I have had to spend the last several days totally writing from scratch
the UTF-16 string functions, and realise that with a bit of common sense
every thing can work out okay. Hopefully quick action to move wchar_t to
2 bytes and create another type for 4 byte strings, we can see a lot of
problems solved. Maybe have UTF-16 strings with L"My String" and UTF-32
with LL"My String" notations.

Changing wchar_t would break the ABI.  It isn't going to happen.

I hope your steering committee can see that there will be lots of UTF-16
text files out there, with a lot of code required to be written to
process those files and while UTF-8 will not support many none Latin
based languages, UTF-32 will not support many none Human base languages
- i.e. no signal system is fault free.

I don't think that such a change can be decreed by the GCC SC.

I don't understand your claim that "UTF-8 will not support many none Latin
based languages".  UTF-8 <http://tools.ietf.org/html/rfc3629> supports
everything from U+0000 to U+10FFFF.  While programs use a variety of
internal representations of characters, successful transmission of data
between machines requires a common interchange format, and UTF-8 is that
format.

Andrew.