On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@xxxxxxxxxxxxxxxx> wrote: >> We're in a space where the evaluation of A==B depends on more than >> the bit strings A and B. Your post about form-insensitive filename >> comparisons is a case in point, although I don't pretend to understand >> it. OK, we can argue whether that's a dark art or simply complicated > > form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b)) > > Except that actually one can greatly optimize this to avoid most of the > compute and memory cost of normalization. > > To see why consider comparing my first name as I usually write it > (Nicolas) vs. how it should be written (Nicolás). The two strings > should compare as not equivalent. But the two ways to write the second > form (with the ´ precomposed vs. decomposed) should compare as > equivalent (because they are). But there's one of the things that makes this a complicated topic: - we say that "nicolas" is not equivalent to "nicolás" - but we say that "nicolás" *is* equivalent to "nicola´s", and we handle this using normalization - does that mean that it's OK to have "nicolas" and "nicolás" as two different usernames assigned to two different users? - if yes, how do we deal with the human interface issues involved? What happens if the human identified as "nicolás" uses an input mechanism that doesn't have a way to enter "á"? How can he log in? - if no, how do we make sure (in an automated way) that we don't make that assignment? - does the answer change if "nicolás" is a domain name instead of a username? - does the answer change if "nicolás" is a *password*? - and what about "nicolàs"? and "nicolâs"? and "nicoläs"? - what about "nicolаs" (that's a Cyrillic character in the penultimate position)? - what about "nicolαs" (that's a Greek character in the penultimate position)? - what about other Unicode characters that look like "a", either exactly (as with Cyrillic) or closely (as with Greek)? - what about handling of "ä" vs "ae"? Do we want to avoid assigning "käse" and "kaese" as distinct usernames? Does the answer to this differ depending upon whether the language is German (where using "ae" to represent "ä" is common) or Swedish (where it is not)? Now extend this to the many other characters that can look similar (say, "n" vs "ñ" in Spanish). Extend it to other language-related issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character variants in Arabic). These are only some of the reasons it's difficult. And the number of people who stand up and say, "oh, just <do this> and the problem is solved," demonstrates that too too too many people *think* they understand... and don't. Barry