Re: Troubles with UTF-8

Harald Tveit Alvestrand <harald@xxxxxxxxxxxxx> · Mon, 26 Dec 2005 03:41:25 +0100

--On 23. desember 2005 11:36 +0100 "Tom.Petch" <sisyphus@xxxxxxxxxxxxxx> 
wrote:

A) Character set.  UTF-8 implicitly specifies the use of Unicode/IS10646
which contains 97,000 - and rising - characters.  Some (proposed)
standards limit themselves to 0000..007F, which is not at all
international, others to 0000-00FF, essentially Latin-1, which suits many
Western languages but is not truly international.  Is 97,000 really
appropriate or should there be a defined subset?

I think Ned has answered most of your other points... I'll chime in on this 
one.....

My opinion: ALL attempts at defining an "useful" character set of any size 
between 128 and "all you can eat" for use internationally have been dismal 
failures. They have been used in some niche, sooner or later there's a need 
to work outside that box, and gateways or other forms of self-torture 
result. (Alvestrand's equality: gateways = pain).

At the moment, the only reasonable candidate for an "all you can eat" 
character set is the Unicode charset. All other alternatives, including the 
bizarrely byzantine character set switching schemes of ISO 2022, are 
basically dead in the marketplace.

So there are only two real choices for charset left: ASCII and Unicode.

ASCII is unsuitable for any language except the technologists' simplified 
version of English. So if you want text, and want it to work 
internationally, there's only one choice left.

Subsets are a mistake.

                           Harald

Attachment:
pgpfW0mnSHJPj.pgp

Description: PGP signature
_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf