Re: Unicode points

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Feb 24, 2005, at 2:53 PM, Bruce Lilly wrote:

o 16-bit Unicode matched well with 16-bit wchar_t

wchar_t is 32 bits on all the computers near me. This is one reason why UTF-16 is irritating for the C programmer.


o while the raw data size doubles in going from 16 bits per character
  to 32 bits, the size of tables (normalization, etc.) indexed by
  character increases by more than 4 orders of magnitude. [yes,
  table compression can be used -- provided the locations and sizes
  of "holes" is guaranteed -- but that requires additional
  computational power]

Unicode data is usually persisted as either UTF-8 or UTF-16, so the fact that 21 bits are potentially available is irrelevant to the space actually occupied. For everyday in-memory character processing, consensus is building around UTF8 (in C/C++ land) and UTF16 (in Java/C# land). I expect wchar_t to be become increasingly irrelevant.


It would be convenient if you could use 64k bitmaps to define character classes, but you can't, so naive implementations are simply not suitable.

My experience since 1996 is that the Unicode people are neither capricious nor malicious. I gather yours is different. -Tim


_______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]