[Last-Call] Re: Opsdir last call review of draft-klensin-idna-rfc5891bis-07

John C Klensin <john-ietf@xxxxxxx> · Thu, 24 Oct 2024 14:42:02 -0400

--On Tuesday, October 22, 2024 13:20 -0700 Linda Dunbar via
Datatracker <noreply@xxxxxxxx> wrote:

> Reviewer: Linda Dunbar
> Review result: Has Nits
> 
> I have reviewed this document as part of the Ops area directorate's
> ongoing effort to review all IETF documents being processed by the
> IESG.  These comments were written primarily for the benefit of the
> Ops area directors. Document editors and WG chairs should treat
> these comments just like any other last-call comments.
> 
> The document provides valuable guidance but also highlights
> significant operational complexity for DNS registries. Ensuring
> compliance with the recommendations, preventing security risks like
> homograph attacks, and handling Unicode updates will require
> considerable operational resources. Can the document recommend some
> tools to alleviate the operation complexity such as automating
> Domain Name validation and filtering? or tools to detect and prevent
> Homograph Attack Detection?

Linda,

Thanks for the review.  

TL;DR - type answer:  That "significant operational complexity"
originates, not in this document or other parts of IDNA, but in the
incredible diversity and variations in human languages, writing
systems, and presentation forms.   Beyond using per-language and
per-script tables of characters such as those mentioned in the draft,
automated tools in this area are beyond the present state of the art
and likely to remain so for a long time.

More detailed response if you (or others) are interested and in the
hope that we can use the discussion about this document to improve
understanding in the IETF:

I don't think this explanation belongs in the document but, if others
disagree, the idea is not horrifying. FWIW, most of the comments
below can be extrapolated to almost any use of a full range of
character in any sort of identifier, not just non-ASCII DNS labels.

As you have understood, the whole point of the document was to
provide the guidance you mention and to make the point about the
complexity.  The question of automated tools is an interesting one,
"interesting" in a way that fascinates some of us and can be a
massive pain in a sensitive part of the anatomy to others, sometimes
even the same people on different days.  It is possible to put tables
together that identify combinations of characters that are of
relatively low risk (that is what, on a script by script basis, the
ICANN LGR efforts are about, especially if combined with a "don't mix
scripts" rule).  The document identifies and points to those efforts.
But, if one wanted to go a step further into automation, it is
necessary to move beyond code points and characters and into type
styles and fonts, things about which someone creating a DNS label has
no control.

Perhaps a Latin script example that most of us have encountered at
one time or another will illustrate the problem.  Depending on the
type style ("font", more or less) chosen, the size of the type, and
maybe even the contrast between the letters and the background
against which they are displayed, lower-case "L" and numeral "1"
either look alike or they don't.  So, for reasons for which simply
looking at Unicode code points or Unicode-based tables are of no help
at all, "abc1" and "abcl" are either homographs or they are not.
That is usually considered a rather different case from the notorious
"paypal" example (cited in the draft) where the Latin-script lower
case "a" characters can be maliciously replaced by Cyrillic
characters that, with most choices of type styles / fonts, usually
look identical.  One can substitute Cyrillic characters that look
more or less like "p" and "y" too, and maybe even the "l" (using
digit-one if needed), but those substitutions require more
assumptions about choices of type styles to avoiding looking alike
(or to cause it).  For example, does the Latin-script "p" look like
the Cyrillic-script "р" (U+0440)?   Maybe.  How about Latin "y" and
Cyrillic "ч" (U+0447)?  More likely to look different with typical
choices of fonts, but not necessarily so with all choices.  And, to
further complicate this, there is a human perception problem: if most
readers who are not expecting these issues and not sensitive to them
see "рачраӏ", they will perceive "paypal" and move on.

So this is a variation on the more general problem with
spoofing-based attacks (ASCII or otherwise) and even the potential
for non-malicious confusion: either we need smarter, better-educated,
and more careful users or we need to cut the problem off at the
source.  In this case, that source probably means registries who take
on the operational responsibilities to which you point.

Could one build an AI that would be trained, not just on equivalents
of the ICANN tables and PRECIS rules, but on a very large selection
of the type styles available for each of the scripts and languages
that could potentially be used on domain name labels?  I think
probably "yes", but trying to do so would not be my idea of
fun.  In practice, probably not feasible and certainly not something
about which the IETF could reasonably say "go use tool XYZ".

Thanks again and best regards,
   john

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx