Re: Possible BofF question -- I18n

Barry Leiba <barryleiba@xxxxxxxxxxxx> · Tue, 5 Jun 2018 01:50:11 -0400

On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@xxxxxxxxxxxxxxxx> wrote:
>> We're in a space where the evaluation of A==B depends on more than
>> the bit strings A and B. Your post about form-insensitive filename
>> comparisons is a case in point, although I don't pretend to understand
>> it. OK, we can argue whether that's a dark art or simply complicated
>
>   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
>
> Except that actually one can greatly optimize this to avoid most of the
> compute and memory cost of normalization.
>
> To see why consider comparing my first name as I usually write it
> (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> should compare as not equivalent.  But the two ways to write the second
> form (with the &acute; precomposed vs. decomposed) should compare as
> equivalent (because they are).

But there's one of the things that makes this a complicated topic:

- we say that "nicolas" is not equivalent to "nicolás"
- but we say that "nicolás" *is* equivalent to "nicola´s", and we
handle this using normalization
- does that mean that it's OK to have "nicolas" and "nicolás" as two
different usernames assigned to two different users?
- if yes, how do we deal with the human interface issues involved?
What happens if the human identified as "nicolás" uses an input
mechanism that doesn't have a way to enter "á"?  How can he log in?
- if no, how do we make sure (in an automated way) that we don't make
that assignment?
- does the answer change if "nicolás" is a domain name instead of a username?
- does the answer change if "nicolás" is a *password*?
- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs"?
- what about "nicolаs" (that's a Cyrillic character in the penultimate
position)?
- what about "nicolαs" (that's a Greek character in the penultimate position)?
- what about other Unicode characters that look like "a", either
exactly (as with Cyrillic) or closely (as with Greek)?
- what about handling of "ä" vs "ae"?  Do we want to avoid assigning
"käse" and "kaese" as distinct usernames?  Does the answer to this
differ depending upon whether the language is German (where using "ae"
to represent "ä" is common) or Swedish (where it is not)?

Now extend this to the many other characters that can look similar
(say, "n" vs "ñ" in Spanish).  Extend it to other language-related
issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
variants in Arabic).

These are only some of the reasons it's difficult.  And the number of
people who stand up and say, "oh, just <do this> and the problem is
solved," demonstrates that too too too many people *think* they
understand... and don't.

Barry