--On Tuesday, October 22, 2024 13:20 -0700 Linda Dunbar via Datatracker <noreply@xxxxxxxx> wrote: > Reviewer: Linda Dunbar > Review result: Has Nits > > I have reviewed this document as part of the Ops area directorate's > ongoing effort to review all IETF documents being processed by the > IESG. These comments were written primarily for the benefit of the > Ops area directors. Document editors and WG chairs should treat > these comments just like any other last-call comments. > > The document provides valuable guidance but also highlights > significant operational complexity for DNS registries. Ensuring > compliance with the recommendations, preventing security risks like > homograph attacks, and handling Unicode updates will require > considerable operational resources. Can the document recommend some > tools to alleviate the operation complexity such as automating > Domain Name validation and filtering? or tools to detect and prevent > Homograph Attack Detection? Linda, Thanks for the review. TL;DR - type answer: That "significant operational complexity" originates, not in this document or other parts of IDNA, but in the incredible diversity and variations in human languages, writing systems, and presentation forms. Beyond using per-language and per-script tables of characters such as those mentioned in the draft, automated tools in this area are beyond the present state of the art and likely to remain so for a long time. More detailed response if you (or others) are interested and in the hope that we can use the discussion about this document to improve understanding in the IETF: I don't think this explanation belongs in the document but, if others disagree, the idea is not horrifying. FWIW, most of the comments below can be extrapolated to almost any use of a full range of character in any sort of identifier, not just non-ASCII DNS labels. As you have understood, the whole point of the document was to provide the guidance you mention and to make the point about the complexity. The question of automated tools is an interesting one, "interesting" in a way that fascinates some of us and can be a massive pain in a sensitive part of the anatomy to others, sometimes even the same people on different days. It is possible to put tables together that identify combinations of characters that are of relatively low risk (that is what, on a script by script basis, the ICANN LGR efforts are about, especially if combined with a "don't mix scripts" rule). The document identifies and points to those efforts. But, if one wanted to go a step further into automation, it is necessary to move beyond code points and characters and into type styles and fonts, things about which someone creating a DNS label has no control. Perhaps a Latin script example that most of us have encountered at one time or another will illustrate the problem. Depending on the type style ("font", more or less) chosen, the size of the type, and maybe even the contrast between the letters and the background against which they are displayed, lower-case "L" and numeral "1" either look alike or they don't. So, for reasons for which simply looking at Unicode code points or Unicode-based tables are of no help at all, "abc1" and "abcl" are either homographs or they are not. That is usually considered a rather different case from the notorious "paypal" example (cited in the draft) where the Latin-script lower case "a" characters can be maliciously replaced by Cyrillic characters that, with most choices of type styles / fonts, usually look identical. One can substitute Cyrillic characters that look more or less like "p" and "y" too, and maybe even the "l" (using digit-one if needed), but those substitutions require more assumptions about choices of type styles to avoiding looking alike (or to cause it). For example, does the Latin-script "p" look like the Cyrillic-script "р" (U+0440)? Maybe. How about Latin "y" and Cyrillic "ч" (U+0447)? More likely to look different with typical choices of fonts, but not necessarily so with all choices. And, to further complicate this, there is a human perception problem: if most readers who are not expecting these issues and not sensitive to them see "рачраӏ", they will perceive "paypal" and move on. So this is a variation on the more general problem with spoofing-based attacks (ASCII or otherwise) and even the potential for non-malicious confusion: either we need smarter, better-educated, and more careful users or we need to cut the problem off at the source. In this case, that source probably means registries who take on the operational responsibilities to which you point. Could one build an AI that would be trained, not just on equivalents of the ICANN tables and PRECIS rules, but on a very large selection of the type styles available for each of the scripts and languages that could potentially be used on domain name labels? I think probably "yes", but trying to do so would not be my idea of fun. In practice, probably not feasible and certainly not something about which the IETF could reasonably say "go use tool XYZ". Thanks again and best regards, john -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx