On Wed, 20 Jul 2005, Klaus Schmidinger (KS) wrote: > > > > I think the confusion comes from the assumption that a character is > > exactly one byte long. > > > > strlen counts bytes not characters. > > in utf-8 a character can be up to 4 (or was it 8) bytes long. > > > > IIRC, there are new functions to count characters (wstrlen, wstrcmp, > > etc.) > > Aren't you confusing this with "wide character" functions? yes, I am talking about wide characters. I don't think I am confusing anything (correct me if I'm wrong) from glibc manual: > Introduction to Extended Characters > > A variety of solutions is available to overcome the differences between > character sets with a 1:1 relation between bytes and characters and > character sets with ratios of 2:1 or 4:1. [...] > > As shown in some other part of this manual, a completely new family has > been created of functions that can handle wide character texts in > memory. The most commonly used character sets for such internal wide > character representations are Unicode and ISO 10646 [...] Unicode was > originally planned as a 16-bit character set; whereas, ISO 10646 was > designed to be a 31-bit large code space. [...] > > UTF-8 is an ASCII compatible encoding where ASCII characters are > represented by ASCII bytes and non-ASCII characters by sequences of 2-6 > non-ASCII bytes [...] > > To represent wide characters the char type is not suitable. > For this reason the ISO C standard introduces [...] wchar_t, > [...] Sergei -- -------------------------------------------------------------------- -?) eMail: Sergei.Haller@xxxxxxxxxxxxxxxxxxx /\\ -------------------------------------------------------------------- _\_V Be careful of reading health books, you might die of a misprint. -- Mark Twain