Diversity, writing systems, identifiers, and protocols (was: Re: Last Call: <draft-ietf-lamps-eai-addresses-05.txt> (Internationalized Email Addresses in X.509 certificates) to Proposed Standard)

John C Klensin <john-ietf@xxxxxxx> · Sat, 04 Feb 2017 09:14:17 -0500

--On Friday, February 3, 2017 14:51 -0500 Viktor Dukhovni
<ietf-dane@xxxxxxxxxxxx> wrote:

>...
> Is that right?  Thus the verifier would sometimes need to
> convert from U-labels to A-labels (when the localpart is all
> ASCII), and at other times from A-labels to U-labels (when the
> localpart is not all ASCII)...

Victor,

I think there is another issue hidden behind this that is worth
mentioning and that interacts with your concern above.  While it
may or may not be important for any given protocol in the
abstract (it will be for some, but not others), using strings
containing non-ASCII characters in ways that interface with
users is always going to involve tricky issues that require
understanding and understanding, not just plugging code points,
possibly with an encoding specified, into slots.  People who
"just" want to be told what to do so they don't need to think
about it, or who want to apply a package they don't understand,
are going to sooner or later find themselves or their users in
trouble, whether the issues are identified as security problems,
matching/equivalence errors of various kinds, user confusion due
to violation of the law of least astonishment, or something
else.   The underlying issues are the result of the wide and
very rich diversity of human writing systems and languages --
systems that are diverse enough that almost any simple statement
or rule one can come up with will have exceptions.

In general, that diversity is something we should celebrate
rather than trying to find quick fixes or tricks to get around,
only partially because those fixes or tricks aren't going to
work well for some group of people.  Narrow views of the
situation just lead to other traps.  In particular, while useful
lessons can be learned, one cannot extrapolate from knowledge or
experience of Latin-based scripts (even if one knows more than
one language that uses them differently) to all others, from
very closely related scripts (e.g., Greek-Latin-Cyrillic, some
subsets of Indic (or neo-Brahmi) scripts, or so-called CJK) to
writing system outside those groups without missing important
cases and causing problems elsewhere.

Like the kinds of diversity we deal with in some other areas,
the differences did not show up overnight.  A large fraction of
the human population has been cr4ating and practicing them for
centuries and, in many cases, tens of centuries.  

If IDNA enters the mix, another layer of knowledge and
understanding is required.  It is actually easier to grasp (or
grok) than the above, but may have even greater impact in
protocol design.  Unlike the above, IDNA is artificial and a
recent invention to solve a very specific problem with
incremental deployment, a decision most of us, including most of
those in the IDN business and those who use non-Latin scripts on
a daily basis, think was probably a good idea.    People simply
need to understand how it works and is intended to evolve, with
the U-label <-> A-label symmetry and checking requirements as
particularly important.  In particular for this case, protocols
which reach the user simply need to be ready to handle U-labels
and A-labels interchangeably.  Because of the combinatorial
explosion problem, trying to do that by enumerating the possible
FQDNs just won't work -- people may know what, e.g., they intend
to push out in email or use in the text of an HTML "a" element,
but there are going to be just too many things in the network
toat will change things back and forth for their own (perfectly
rational) reasons.   At least IMO, that makes a very strong
argument for protocols, where it is possible, defining and using
a single canonical form and expecting user interfaces to do
conversions as needed.

You may reasonably disagree with that last conclusion because it
is just a protocol design preference, but most of the rest
almost certainly must to be treated as immutable facts, at least
until and unless we all agree to use the same language and the
same orthography and writing system for that language.

   best,
      john