Re: Unicode and C++ (GCC 4.0)

Eljay Love-Jensen <eljay@xxxxxxxxx> · Mon, 08 Aug 2005 06:15:41 -0500

Hi Christian,

> a UTF-8 character may (sometimes) be even 16bit of size; think about german
umlauts, they get escaped somehow, and then follows the code for them.

The Utf8Char class does not represent a full Unicode codepoint, it
represents an 8-bit chunk.  A Unicode codepoint is encoded as 1 to 4 8-bit
UTF8 chunks.

A Unicode codepoint is 21-bits.

NOTE: ISO/IEC 10646 codepoint is 31-bits.  I'm not concerned about ISO/IEC
10646.

> This really can't work either. Did you get this compile?

Sorry, copy-n-paste error.  I did get something similar to compile.  The
example was supposed to be illustrative.

> Sorry for not being much help more, but you could/shoult google for "encoding"
> and have a look around for some other implementations / how they're doing it
> (the en[/de]coding-way).

I've Googled.  I've spoken with various members of the C++ steering
committee.  I've spoken with different forums.

I did get some useful pointers from Martin York of Symantec.

I was directed to Standard C++ IOStreams and Locales by Langer and Kreft,
which I haven't picked up yet but will shortly.

Thanks,
--Eljay