Re: UTF-8, UTF-16 and UTF-32

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Dallas,

> Thanks for your reply, but with Pictorial languages such as Cantonese and
> Mandarin, that have up to 60,000 character in the full set (one picture for
> each word), using locality page sheets with UTF-8 is limited.

UTF-8 does not use locality page sheets.  (Are you conflating UTF-8 and
Windows Code Pages?  Ala the difference between the FooA() ACP routines, and
the FooW() Wide character routines?)

UTF-8 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of octets, 1 to 4 octets (1-4 bytes).  UTF-8 supports the entire
gamut of Unicode characters.

UTF-16 encodes Unicode characters from U+00000 to U+10FFFF in a variable
number of 16-bit chunks, 1 or 2 of them (2 or 4 bytes).

UTF-32 encodes Unicode characters from U+00000 to U+10FFFF in a single
32-bit chunk (4 bytes), with 11 of the 32 bits being fallow.

> GCC and MS VC++ are now inconsistent with their wchar_t types and this
> difference will make it nearly impossible for us to continue supporting
> Linux, i.e. in a choice between Linux and Windows, I have to follow my
> customers.

GCC and MS VC++ are not inconsistent.  Both of those compilers comply with
the ABI of the platform that they target.

There is not requirement in any platforms ABI that I work with that char be
a UTF8 and wchar_t be UTF16 or UTF32.

Perhaps what you need is to make your own character type (or, technically,
encoding unit type):

struct UTF8
{
  typedef uint8_t Type;
  Type mEncodingUnit;
};

struct UTF16
{
  typedef uint16_t Type;
  Type mEncodingUnit;
};

struct UTF32
{
  typedef uint32_t Type;
  Type mEncodingUnit;
};

Or use a Unicode savvy library like ICU <http://www.icu-project.org/>.

> I am not trying to deny UTF-32 or saying that GCC should not support it, I
> am saying that GCC should support all three Unicode formats because UTF-16
> is a format that I have to deal with in the real world. Why not support all
> three formats?

GCC does not support Unicode.

Some libraries (that are not part of GCC) support Unicode.

Perhaps parts of the OS support Unicode, in some transformation format, with
their LANG environment, or Window's 65001, 65005, 65006, 1200, 1201 code
pages, or Mac OS X's kCFStringEncodingUnicode, kCFStringEncodingUTF8,
kCFStringEncodingUTF16, kCFStringEncodingUTF16BE, kCFStringEncodingUTF16LE,
kCFStringEncodingUTF32, kCFStringEncodingUTF32BE, kCFStringEncodingUTF32LE.

The only computer languages that I'm aware of that support Unicode are:
+ Python 2.3 (somewhat, as an opt-in transition feature)
+ Python 2.5 (somewhat)
+ Python 3.0 (very well)
+ D Programming Language (very well)
+ Java (very well)

My favorite computer languages do NOT support Unicode "out of the box" (by
"support" I mean both Unicode source code, which can target Unicode
applications):
+ C
+ C++
+ Lua

With add-on libraries and/or OS API support, discipline, and a bit of luck,
those languages can target Unicode applications.

I can't see Lua supporting Unicode "out of the box" without increasing it's
tiny embedded scripting engine footprint by over an order of magnitude.

> As someone with has written a scripting language based on C++, I can tell
> you that changing the 'wchar_t' to something else would only take five
> minutes - it wouldn't break any thing.

It would break the OS ABI, which is defined by the OS, not by the compiler.

HTH,
--Eljay


[Index of Archives]     [Linux C Programming]     [Linux Kernel]     [eCos]     [Fedora Development]     [Fedora Announce]     [Autoconf]     [The DWARVES Debugging Tools]     [Yosemite Campsites]     [Yosemite News]     [Linux GCC]

  Powered by Linux