Re: UTF-8, UTF-16 and UTF-32

"Dallas Clarke" <DClarke@xxxxxxxxxxxxxx> · Sat, 23 Aug 2008 12:36:50 +1000

Hello Scott,

I guess that ASCII would be char, UTF-8 would be unsigned char, UTF-16 would 
be wchar_t and UTF-32 would be long wchar_t. But it is more appropriate just 
to have the three sizes of strings, i.e. 8-bits, 16-bits and 32 bits, and 
the ability to have const 16-bit strings.

wchar_t* strchr(wchar_t *string, wchar_t chr){
   while(*string != '\0' && *string != chr) ++string;
   if(*string == chr) return string;
   return NULL;
}

const wchar_t* strchr(const wchar_t *string, wchar_t chr){
   while(*string != '\0' && *string != chr) ++string;
   if(*string == chr) return string;
   return NULL;
}

Cheers,
Dallas.
http://www.ekkySoftware.com/

----- Original Message ----- 
From: "me22" <me22.ca@xxxxxxxxx>
To: "Dallas Clarke" <DClarke@xxxxxxxxxxxxxx>
Cc: "Eljay Love-Jensen" <eljay@xxxxxxxxx>; "GCC-help" <gcc-help@xxxxxxxxxxx>
Sent: Saturday, August 23, 2008 12:12 PM
Subject: Re: UTF-8, UTF-16 and UTF-32

On Fri, Aug 22, 2008 at 21:37, Dallas Clarke <DClarke@xxxxxxxxxxxxxx> 
wrote:

Standardise: - sizeof(char) = 1; sizeof(wchar_t) = 2; and sizeof(long
wchar_t) = 4.

Do you mean "standardize char as UTF-8, wchar_t as UTF-16, and long
wchar_t as UTF-32"?  Because that's not what you said, even if (on
POSIX, but not necessarily C or C++) the sizes would be appropriate.

Implement all the string functions: - strcmp(); mbscmp(); wcscmp(); and
lcscmp().

How exactly do you plan on implementing strchr for UTF-16?
Specifically, what would its signature be?

~ Scott