On Thu, 30 Oct 2003 09:13:55 PST, niket@xxxxxxxxxxxxxx said: > Forget Mongolian. Think Chinese and Hindi, plus related languages that > use their character sets. Between the two of them you have nearly 3 > billion potential users, i.e. half the world's population. Admittedly > not all of them are literate, and many do understand the Latin character > set, but this is still a very large group to disenfranchise. Right. The yet unanswered question is how many would be disenfranchised by making them learn the Latin charset, compared to how many would be disenfranchised by a non-perfect globalization scheme (see my comment yesterday regarding macrons and carons). > There is a second thread to your argument which I object to. Just > because many Internet users can understand the Latin character set does > not mean they do not want to send stuff in their native character set, > or be forced to use the Latin character set. Of course so far we have > made it impossible to do so. Note that this is discussing *addresses only*. We've had charset support for bodyparts and 2047-encoding for other header fields for *years*. I get at least 5 or 6 emails a day that have addresses of the form From: "kanji/big5/etc string here" <romanized.name@xxxxxxxxxxxxxxxxxxxx> and/or have charset=utf-8 and kanji in them. > Why place unnecessary restrictions on the Internet just because it > results in messages that you personally can't understand? An equally important consideration is that it result in messages that are *usable* (possibly without comprehension). If whatever scheme we decide on results in messages that I can't hit "reply" to or otherwise process, it's not doing anybody any favors. An often overlooked aspect of the ASCII charset is that it has 52 glyphs which for the most part are visually distinctive (except for zero/oh, and one/lower-ell), so even a non-speaker can make a determination "have I entered the same glyphs as are on the business card?". This is not true for any of the Asian glyph sets (at least *I* can't tell easily), and I don't think that the Latin 1/A/B extension has this property either, once you start dealing with macrons, cedillas, ogonceks, carons, dots, and other ornamentation.... So my question remains: are we doing the 3 billion asians a favor by forcing them to be able to tell the difference between e-caron and e-breve? Are we doing *anybody* favors if we make them use rfc3490-style xn-- strings that are totally incomprehensible if they are from outside the local conclave? Remember - if they don't understand Latin charsets, a 3490-encoded address will be *painful*, even for the *owner*. You don't believe me? Take the character string 'valdis.kletnieks', change the first e to 0113 (small e-macron), punycode it, and let me know how much mnemonic value it has. And remember - the string you get there is the sort of thing that all 3 billion Asians will get to enter (after I get my sysadmin to set up the aliases to get that punycode to actually drop into *my* mailbox). Are you sure it's worth the effort? It's not that I'm unsympathetic to the goals - far from it. It's just that I was there during the RFC2047 wars (which are *still* going on in the spam world, silly spammers sending around untagged 8-bit headers), and a big part of me wants to say "Oh no, not again....".
Attachment:
pgp00347.pgp
Description: PGP signature