Re: IDN security violation? Please comment

John C Klensin <john-ietf@xxxxxxx> · Sat, 12 Feb 2005 09:00:09 -0500

--On Friday, 11 February, 2005 21:02 -0500 Bruce Lilly
<blilly@xxxxxxxxx> wrote:

> While I do not dispute that some mobile devices might use some
> subset of some version of Unicode for text in some languages,
> my point was, in response to John Klensin's "Until and unless
> every one of us has a keyboard that permits easy input of
> every Unicode character", that not only do I not expect to
> have a keyboard permitting *easy* entry (no, that doesn't mean
> "Grafiti" or "Decuma") of *every* Unicode character any time
> soon, I don't expect it *ever*, because the Unicode code space
> is expanding (in contradiction to the original Unicode Design
> Principles) faster than the available memory space on
> low-power, compact, mobile devices.

Bruce (and others),

You can argue and pick at this interminably, but I think you are
missing the key point.

There is, IMO, an extremely strong argument for saying 

	"Look, DNS names, and DNs as used in X.509 certs, are
	ultimately protocol identifiers.  Safe and stable
	operation of the Internet requires that protocol
	identifiers be written in a small, restricted, generally
	recognized, and easily distinguishable, set of
	characters.  And everyone who has studied which
	characters to use when the principles of "protocol
	identifiers" and " statements are applied, including our
	very internationalization-conscious friends at the ITU,
	have concluded that the right characters are a subset of
	those in the Roman-based script family.  The subset seem
	to always be "without the embellishments of diacritical
	marks or other embellishments".  It is almost always
	defined in terms of case-independent matching rules or
	in terms of only a single case being permitted -- more
	often upper historically, although there are some
	substantive arguments for lower. "

The choice of Roman characters is ultimately based on the
observation that, while there are several _languages_ that are
more  widespread than English, nothing in the above says
anything about English.  Those Roman-based characters are, for
one reason or another, used, either as a primary or a secondary
script, by more languages and people than everything else in the
world put together.  That contributes significantly to
"recognizable", which is an important criterion.

And neither the "protocol parameter" argument, nor the argument
that more characters would lead to more opportunities for
confusion, did not come as a surprise to the IETF community
within the last week or two.  Both arguments were raised,
passionately and at great length, when the IDN effort was first
coming together.  They were raised on the IETF list, on more
than one WG list, in BOFs, etc.  

There is a second argument that can be made with equal strength.
People like to write their names correctly.  Inability to do
that is a profound source of irritation (at least) and was
important enough, even in the 60s, to influence the way
characters are handled in important operating systems to this
day.  More generally, people prefer that the identifiers they
pick have mnemonic value to them, and that means the ability to
pick those identifiers based on their languages and scripts.
Please note that argument applies at the geek interface level;
we don't need to get up to the user interface one to make it.
When we do get to the user interface and start worrying about
non-expert would-be users of the Internet, we immediately
encounter some very passionate, and almost certainly correct,
arguments that users should be able to deal with, and navigate,
the Internet and do so completely in their own languages and
scripts.  

The problems with that argument, including opportunities for
deliberate or accidental confusion among similar-looking
characters, also come as no surprise to the IETF.  Like the
"protocol parameter" position, they were discussed openly and at
great length, with examples, many years ago.

With both of those arguments in hand, and with the problems with
each at least moderately well understood, the IETF (or at least
everyone who could be persuaded to pay attention) made a
decision.  That decision was made years ago and under
considerable marketplace pressure, that, for the particular set
of issue areas that included DNS names, the second set of
arguments -- that accessibility in "native scripts" (and Unicode
in particular) were more important than the "protocol
identifier" argument-- were the dominant ones and that we needed
to do this.   By implication at least, we decided that we would
need to accept and understand the problems that decision caused
and deal with them.

There were another group of questions, which are the more
complicated piece of the issue.  The obvious way to get the
right functionality is not necessarily the best one.  There is a
nasty tradeoff between techniques that can, at least in theory,
be deployed quickly and ones that are likely to take longer but
might be more satisfactory in the long term.   There is another
nasty tradeoff between making something work well for the people
who know that they need it and are willing to make an investment
in conversion and upgrading of systems to get it versus making
it work reasonably well (and perhaps more quickly) for everyone.

Again, the IETF made decisions on those points.  My personal
view is that some of those decisions were not especially
well-informed and may even have been wrong, but they were
decisions made in the community and made after the dissenting
views were strongly expressed.

So, today, we've got IDNs and IDNA.   Even if one believes that
the _only_ reason for standardizing them is to provide a common,
interoperable, way of doing something that people will clearly
do somehow, the standards seem justified.  (For the record, I do
not subscribe to the "that is the only reason for a standard"
position in this case.)   I see no way to go back, even if we
wanted to, and reestablish the "protocol parameter" argument for
the DNS.

So we are down to some serious and important questions --  but,
again, ones that are neither new nor surprising.   In
particular, since you and others have picked up bits from my
earlier notes and interpreted them (I'm sure unintentionally)
differently from what I intended:

(i) The observation about YAH00 versus yah00 wasn't intended to
say that a lower case test would solve very many problems.  It
was only to point out that the particular YAH00 example wasn't a
particularly good one, since it could be detected by the most
trivial of tests.  I agree that test is not likely to be
effective against a determined attacker or more clever examples.

(ii) I have never argued that the "one label one script"
requirement that Mark Davis and others have suggested is without
value.  My comment was only that a requirement of that type was
going to be a little harder to apply --in many cases and
consistently-- than a casual reader might assume.  None of this
is easy.  Life is hard.

(iii) The observation about "...easy input of every Unicode
character" is not, in any respect, an attempt to get us back to
protocol identifiers.  It was, instead, about one of the more
subtle questions associated with the IDNA story.  IDNA's most
passionate advocates are convinced that, once a sufficient
deployment level is achieved, no one will need to look at the
internal, "punycode" form of IDNs, but will see only the "native
character" form.  Others of us are convinced that user-visible
punycode will be around forever, just as user-visible URLs will
be.  We believe that will be driven partially by security
concerns (I can more accurately compare two punycode strings by
eyeball than I can a pair of arbitrary "native character"
strings).  We believe that the difficulties you might have
reading an IRI that contains an unfamiliar script out of a
printed article or sign and typing it into a computer will cause
you to wish that the punycode representation were readily
available, because "recognize the character and then figure out
how to key it in" is likely to be an insurmountable pair of
problems.   The issue isn't one of  the expansion of Unicode or
how many keystrokes are needed: if you can identify the
character, any BMP Unicode character can be keyed in a little
over four keystrokes, and non-BMP characters don't take many
more (the "little" is determined by whatever you need to do to
indicate that characters are being specified by offset.  The
issue is recognizing the character accurately in the first
place.  The cell phone story is equally unimportant because the
first step in that story is identifying the right language so as
to permit you to pick up the right phone (or switch it into the
right state).  Language identification may or may not be harder
than character identification, but it isn't likely to be easy in
the general case.  Without language identification, you are back
to character identification and four (or five or six) digit
offsets.

(iv) The TLD managers worldwide are not crying "please protect
us from IDNs" and this latest "discovery" is unlikely to change
that.  What they are saying is "we want and need to implement
IDNs, please help us understand how to do that safely".   The
answer to that question doesn't require "regulation" from on
high.  It does require getting and sharing a much more subtle
understanding of the issues, options, and tools than we have so
far been successful in communicating.  IMO, the IETF should be
putting energy into those issues and tools --and to alternatives
to the use of DNS names (with IDNA) when that is appropriate.
But efforts to move in those directions have gotten zero
traction.   _That_ is, IMO, our problem, not whether we can turn
back the clock and make a "protocol parameter" decision (or turn
it back even further and reduce the number of scripts and
characters in the world be several orders of magnitude).

This isn't easy.  It is never going to be easy.  It poses
opportunities for various nasty behavior that are harder to
detect and defeat in a hostname/LDH-only world. The easiest way
to get ourselves into trouble is probably to pretend it is easy
and ignore the hard, risky, or edge cases.  We need to learn to
cope: wishing for an easier and more homogeneous world or easier
times generally, or wishing that an irreversible decision be
reversed, won't get us much of anywhere, no matter how
passionately those wishes are made.  And, like it or not, we are
at as least as much risk of fragmenting the Internet by
appearing to say "no" to some languages or scripts as we are
from confusion among characters in well-thought-out
internationalization efforts.

     john

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf