Re: Possible BofF question -- I18n

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



----- Original Message -----
From: "Barry Leiba" <barryleiba@xxxxxxxxxxxx>
Sent: Tuesday, June 05, 2018 6:50 AM

On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@xxxxxxxxxxxxxxxx>
wrote:
>> We're in a space where the evaluation of A==B depends on more than
>> the bit strings A and B. Your post about form-insensitive filename
>> comparisons is a case in point, although I don't pretend to
understand
>> it. OK, we can argue whether that's a dark art or simply complicated
>
>   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
>
> Except that actually one can greatly optimize this to avoid most of
the
> compute and memory cost of normalization.
>
> To see why consider comparing my first name as I usually write it
> (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> should compare as not equivalent.  But the two ways to write the
second
> form (with the &acute; precomposed vs. decomposed) should compare as
> equivalent (because they are).

But there's one of the things that makes this a complicated topic:

- we say that "nicolas" is not equivalent to "nicolás"
- but we say that "nicolás" *is* equivalent to "nicola´s", and we
handle this using normalization
- does that mean that it's OK to have "nicolas" and "nicolás" as two
different usernames assigned to two different users?
- if yes, how do we deal with the human interface issues involved?
What happens if the human identified as "nicolás" uses an input
mechanism that doesn't have a way to enter "á"?  How can he log in?
- if no, how do we make sure (in an automated way) that we don't make
that assignment?
- does the answer change if "nicolás" is a domain name instead of a
username?
- does the answer change if "nicolás" is a *password*?
- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs"?
- what about "nicolаs" (that's a Cyrillic character in the penultimate
position)?
- what about "nicolαs" (that's a Greek character in the penultimate
position)?
- what about other Unicode characters that look like "a", either
exactly (as with Cyrillic) or closely (as with Greek)?
- what about handling of "ä" vs "ae"?  Do we want to avoid assigning
"käse" and "kaese" as distinct usernames?  Does the answer to this
differ depending upon whether the language is German (where using "ae"
to represent "ä" is common) or Swedish (where it is not)?

Now extend this to the many other characters that can look similar
(say, "n" vs "ñ" in Spanish).  Extend it to other language-related
issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
variants in Arabic).

These are only some of the reasons it's difficult.  And the number of
people who stand up and say, "oh, just <do this> and the problem is
solved," demonstrates that too too too many people *think* they
understand... and don't.

<tp>

Barry,

This is a cut out and keep e-mail that I shall still be referring to in
10 years time because it summarises so beautifully the problems.  It
also is the kind of data that led me upthread to assert, contentiously,
that only Europeans were likely  to know what it was about since they
had been living with for decades in a way the most Americans had not.  I
mentioned CJK but John rightly pointed out that, worldwide, it was far
worse, with right-to-left, vertical and so on and so the most skills may
now lie further afield.

Two thoughts.  One is that your e-mail displayed superbly (the only
glitch being that my MUA did not differentiate the Cyrillic character)
so I looked at the encoding

Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

so that is something we got brilliantly right a long time ago.  (The bad
news is that I get an ever growing number of messy e-mails from some
e-mail ids who use some complicated Unicode characters instead of ASCII
punctuation; a sort of single quote being the commonest so the
technology, for me, gets misused).

My second thought is that much has been done in the IETF on security in
recent times but have we done enough to at least publicise, if not
eliminate, the scope for evil actors to exploit confusable and suchlike
characters by saying that they SHOULD NOT be used anywhere where it
matters for security - people SHOULD NOT be handed the rope with which
to hang themselves on a plate:-) - I suspect not.

Tom Petch


Barry




[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux