Re: IDNA and U+08A1 and related cases (was: Re: Barry Leiba's Discuss on draft-ietf-json-i-json-05: (with DISCUSS and COMMENT))

John C Klensin <john-ietf@xxxxxxx> · Mon, 26 Jan 2015 18:08:40 -0500

--On Monday, January 26, 2015 12:13 -0600 Nico Williams
<nico@xxxxxxxxxxxxxxxx> wrote:

> On Mon, Jan 26, 2015 at 07:35:42AM -0800, Asmus Freytag wrote:
>> On 1/26/2015 1:12 AM, Nico Williams wrote:
>> > As far as I'm concerned it's clear that the correct way to
>> > handle these cases is: as confusables.  Is this wrong?
>> 
>> I basically agree with you.
>> 
>> I'm making a further distinction between confusables by
>> accident and confusables by intent, and am advocating that
>> the latter can be handled more explicitly. But basically, yes.
> 
> I'm not sure that the cause of the confusability makes any
> difference. It's there.  Once it's there we have to deal.

I don't know if this is related to what Asmus was thinking, but
I think there are two kinds of intent.  One is intent in the
Unicode coding, where a deliberate decision was made to assign
different code points to glyphically-identical characters
("homographs" or "homoglyphs") within the same script.  For that
case, we, or at least UTC, presumably know what the characters
are and could, at least in principle, make a list of them or
assign a special property value to them.  In the grand scheme of
things, they should be very easy to identify even if what to do
once they are identified might be controversial.   For example,
if the only tool we had was to ban one or another code point or
combining sequence of a "confusable" pair, it may not be obvious
which one to prohibit.   Taking U+08A1 as an example because it
is the case that started this, if one were not constrained by
stability rules or the like, it would not be clear whether it
would be better to allow it (because it obeys the rules that
Asus summarized in yesterday afternoon's note and is more
compact) or the combining sequence \u0628\u0654 (because it is
more likely to be used/ expected/ keyed in by Arabic speakers or
users of languages written in Perso-Arabic variations and there
are many more people in those two groups than there are writers
of Fula in Arabic script).

The other sort of intent involves a would-be attacker
deliberately trying to create confusion, to mislead the user, or
to create distrust of the identifier system, in the IDNA case,
the DNS or IDNs generally.  There is nothing accidental about
those cases and they are difficult precisely because none of the
fine distinctions we are making about the differences among "the
same glyph (grapheme cluster)", "the same (or different)
abstract character", and "things that look alike under some set
of circumstances that an attacker might be able to control or
exploit".

And then there are accidents, either of cross-script
similarities or identities (because of historical copying, some
may not really be accidental) or of user perception because of
combinations of appearance of the characters and user
perceptions.

I think the first of these is (or should be) much easier to
handle in a systematic way than the other two, but that, if we
want internationalized identifiers, we'd better be able to do
better with the others than trying to educate users to be
really, really, careful, perhaps to the point of paranoia ... 

--On Monday, January 26, 2015 12:09 -0600 Nico Williams
<nico@xxxxxxxxxxxxxxxx> wrote:

>...
>> Yes, indeed. Which is why, for years, this was a requirement
>> of IDNA enablement in Firefox. Only the proliferation of
>> registries put an end to our enforcement of that policy
>> programmatically. We (or at least, I) now intend to enforce
>> it via the media if there is ever a problem caused by a
>> registry allowing one of its customers to attack another one
>> by registering a homograph.
> 
> Right, if a registry screws this up, their reputation has to
> suffer.
> 
> (The same goes for CAs, no?  Though of course DNS has to come
> first.)

While I'm certainly in favor of shaming evildoers, keep two
things in mind.  First, while the number of distinct registry
operators is much smaller, the number of TLDs may soon exceed
the number of active CAs.  The total number of zones and zone
administrators probably deserves terms like "astronomical".
Perhaps unlike the CA environment (or perhaps not), there is a
fairly impressive history of registrars and retailers who are
willing to delegate obviously-deceptive names if doing so
improves their bottom line even slightly and who are quite happy
to hide the names and contact information of their customers.
If we don't do the best we can to control that situation, it
more or less invites regulator intervention that could fragment
the DNS namespace or worse.  I certainly have days, and assume
that Gerv does too, when I'd be delighted to see those
regulators and their law enforcement associates show up.  But
history suggests that, when they do, they are likely to be
heavy-handed enough to be very bad for the DNS and the Internet.

    john

   john

> 
> The details of how a confusable came about are certainly
> interesting, but they don't really matter to how we handle
> them, right?
> 
> Nico