----- Original Message ----- From: "Barry Leiba" <barryleiba@xxxxxxxxxxxx> Sent: Tuesday, June 05, 2018 6:50 AM On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@xxxxxxxxxxxxxxxx> wrote: >> We're in a space where the evaluation of A==B depends on more than >> the bit strings A and B. Your post about form-insensitive filename >> comparisons is a case in point, although I don't pretend to understand >> it. OK, we can argue whether that's a dark art or simply complicated > > form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b)) > > Except that actually one can greatly optimize this to avoid most of the > compute and memory cost of normalization. > > To see why consider comparing my first name as I usually write it > (Nicolas) vs. how it should be written (Nicolás). The two strings > should compare as not equivalent. But the two ways to write the second > form (with the ´ precomposed vs. decomposed) should compare as > equivalent (because they are). But there's one of the things that makes this a complicated topic: - we say that "nicolas" is not equivalent to "nicolás" - but we say that "nicolás" *is* equivalent to "nicola´s", and we handle this using normalization - does that mean that it's OK to have "nicolas" and "nicolás" as two different usernames assigned to two different users? - if yes, how do we deal with the human interface issues involved? What happens if the human identified as "nicolás" uses an input mechanism that doesn't have a way to enter "á"? How can he log in? - if no, how do we make sure (in an automated way) that we don't make that assignment? - does the answer change if "nicolás" is a domain name instead of a username? - does the answer change if "nicolás" is a *password*? - and what about "nicolàs"? and "nicolâs"? and "nicoläs"? - what about "nicolаs" (that's a Cyrillic character in the penultimate position)? - what about "nicolαs" (that's a Greek character in the penultimate position)? - what about other Unicode characters that look like "a", either exactly (as with Cyrillic) or closely (as with Greek)? - what about handling of "ä" vs "ae"? Do we want to avoid assigning "käse" and "kaese" as distinct usernames? Does the answer to this differ depending upon whether the language is German (where using "ae" to represent "ä" is common) or Swedish (where it is not)? Now extend this to the many other characters that can look similar (say, "n" vs "ñ" in Spanish). Extend it to other language-related issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character variants in Arabic). These are only some of the reasons it's difficult. And the number of people who stand up and say, "oh, just <do this> and the problem is solved," demonstrates that too too too many people *think* they understand... and don't. <tp> Barry, This is a cut out and keep e-mail that I shall still be referring to in 10 years time because it summarises so beautifully the problems. It also is the kind of data that led me upthread to assert, contentiously, that only Europeans were likely to know what it was about since they had been living with for decades in a way the most Americans had not. I mentioned CJK but John rightly pointed out that, worldwide, it was far worse, with right-to-left, vertical and so on and so the most skills may now lie further afield. Two thoughts. One is that your e-mail displayed superbly (the only glitch being that my MUA did not differentiate the Cyrillic character) so I looked at the encoding Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable so that is something we got brilliantly right a long time ago. (The bad news is that I get an ever growing number of messy e-mails from some e-mail ids who use some complicated Unicode characters instead of ASCII punctuation; a sort of single quote being the commonest so the technology, for me, gets misused). My second thought is that much has been done in the IETF on security in recent times but have we done enough to at least publicise, if not eliminate, the scope for evil actors to exploit confusable and suchlike characters by saying that they SHOULD NOT be used anywhere where it matters for security - people SHOULD NOT be handed the rope with which to hang themselves on a plate:-) - I suspect not. Tom Petch Barry