Re: [idn] WG last call summary

"D. J. Bernstein" <djb@cr.yp.to> · 17 Mar 2002 00:09:25 -0000

Users suffer every day from the current character-encoding mess. Anyone
who ventures beyond ASCII is faced with endless trouble.

There's no theoretical obstacle to making multiple character encodings
work---make sure that there is, implicitly or explicitly, a perfectly
clear character encoding for every byte stream; make sure that every
copy and comparison includes all necessary conversions---but in practice
this is a disaster. It's too much work to specify the encodings. It's
far too much work to program all the necessary conversions.

Look at IDNA. Yet another character encoding. An amazingly unclear
specification of which byte streams are supposed to use that encoding.
Massive redeployment of, at a minimum, every web browser in the world.
And that's just for domain names! Is every worldwide identifier---for
example, mailbox names---supposed to have its own massive upgrade?

UTF-8 offers a way out of this mess. We do _one_ upgrade to make sure
that UTF-8 works everywhere. For example, RFC 2277, IETF Policy on
Character Sets and Languages, requires UTF-8 support in all protocols.
Then we convert all stored data to UTF-8. Then, finally, we can drop
support for the other character encodings.

Do we want programmers in twenty years to be faced with the same mess
that we have today? Or do we want them focusing on positive features for
the users?

Keith Moore writes:
> 	The on-the-wire encoding of IDNs is irrelevant; what matters is the
>  	behavior experienced by users.

Everything is judged by the user experience, yes, but you are clearly
incorrect in saying that the encoding is irrelevant.

---D. J. Bernstein, Associate Professor, Department of Mathematics,
Statistics, and Computer Science, University of Illinois at Chicago