Re: Possible BofF question -- I18n

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 5, 2018 at 1:50 AM Barry Leiba <barryleiba@xxxxxxxxxxxx> wrote:
On Mon, Jun 4, 2018 at 11:10 PM, Nico Williams <nico@xxxxxxxxxxxxxxxx> wrote:
>> We're in a space where the evaluation of A==B depends on more than
>> the bit strings A and B. Your post about form-insensitive filename
>> comparisons is a case in point, although I don't pretend to understand
>> it. OK, we can argue whether that's a dark art or simply complicated
>
>   form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))
>
> Except that actually one can greatly optimize this to avoid most of the
> compute and memory cost of normalization.
>
> To see why consider comparing my first name as I usually write it
> (Nicolas) vs.  how it should be written (Nicolás).  The two strings
> should compare as not equivalent.  But the two ways to write the second
> form (with the &acute; precomposed vs. decomposed) should compare as
> equivalent (because they are).

But there's one of the things that makes this a complicated topic:

I was describing a specific primitive, but i do like your taking this further:

- we say that "nicolas" is not equivalent to "nicolás"
- but we say that "nicolás" *is* equivalent to "nicola´s", and we
handle this using normalization

Right, that is simple enough.  For some value of simple.  You need normalization code (which isn't trivial), then it is simple.

- does that mean that it's OK to have "nicolas" and "nicolás" as two
different usernames assigned to two different users?

In filesystems there's also whether to be case-sensitive, and out can be a per-filesystem opt-in.

As to usernames, principal names, and so on, well, it's a rather subjective choice.  "Nicolas" is perfectly correct in French, and is distinct from "Nicolás", though it can be confusing, especially if you have software that cannot display accents...

Now, they obvious question is: is this something a protocol should address by making &acute; equivalent to 'a' globally, or should this be policy local to an appropriate administration domain?  Well, that's a bit of a judgement call, but the best option is to give people the freedom to make that choice where possible.  Thus, not globally considering any combinations of 'a' equivalent to each other and 'a... is the better approach.

In terms of DNS, consider a proposal to ban mixing of scripts in any one label...  But in South Korea it is common to mix Hangul with -ing endings, so why should .kr not be allowed to use at least that sort of actor mixing?  There are almost certainly other similar cases, and more will arise as culture evolves!

Who is in a better position than they registries to make such a decision?  Certainly NOT the IETF, not any one participant and not the IETF collectively.

There is a big difference between form equivalence (same exact character, two or more ways to represent it as the codepoint level) and confusables.  We can trivially (see above) deal with the former, but the latter is going to need local policy.  I really don't see a better answer re: confusables, and i know that's not a popular opinion, but i don't think it's wrong.

- if yes, how do we deal with the human interface issues involved?
What happens if the human identified as "nicolás" uses an input
mechanism that doesn't have a way to enter "á"?  How can he log in?

Answered above.

- if no, how do we make sure (in an automated way) that we don't make
that assignment?

This one is easy: IF you really want this (i don't think we should want this globally) decompose (normalize to NFD) then drop combining codepoint.  This answer won't work for cross-script confusables, naturally, which is partly why i wouldn't recommend this approach.

- does the answer change if "nicolás" is a domain name instead of a username?

Same answer!  The local authority (here: the registry) should decide this, write a policy, and enforce it (by having the registrars implement it).  (See comments below about user-agents as vessels for local policy as well.)

I don't think we can make such a policy globally that doesn't risk angering some local communities.

- does the answer change if "nicolás" is a *password*?

It can.  Losing some entropy in a password might be safe, but this is simpler as a global policy rather than as local policy.  It's even simpler to tell users to only use characters they can reliably input on all devices (this isn't as trivial as it should be, but by and large this approach works).

- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs?
- what about "nicolаs" (that's a Cyrillic character in the penultimate
position)?
- what about "nicolαs" (that's a Greek character in the penultimate position)?
- what about other Unicode characters that look like "a", either
exactly (as with Cyrillic) or closely (as with Greek)?
- what about handling of "ä" vs "ae"?  Do we want to avoid assigning
"käse" and "kaese" as distinct usernames?

Same answers as above.

Does the answer to this
differ depending upon whether the language is German (where using "ae"
to represent "ä" is common) or Swedish (where it is not)?

Only if the context can let an end-user choose one (or more) language(s).  DNS, for example, cannot.  A filesystem cannot either.  Text documents / word processors can (and might, especially in a search function).

Now extend this to the many other characters that can look similar
(say, "n" vs "ñ" in Spanish).  Extend it to other language-related
issues ("i" vs "ı" vs "İ" vs "I" in Turkish; all the character
variants in Arabic).

Same answers as above.

The protocols should be permissive.  Local policies should be less so - perhaps no more permissive than is absolutely necessary.

Note that a user-agent is also a place where local policy can be applied.  In fact, there exist browser extensions to deal with confusables.

These are only some of the reasons it's difficult.  And the number of
people who stand up and say, "oh, just <do this> and the problem is
solved," demonstrates that too too too many people *think* they
understand... and don't.

It's difficult because our world culture has globalized while at the sane tone we are not willing to unify confusable characters.  When i say "we" here i mean mankind in all its local polities..  We've tried Han unification, and that failed as a matter of politics.  We (the IETF) can hate this all we like, but we cannot change it and should not even try.  We've talked about human rights and I18N.. some might say that getting their characters drawn the way they want without needing a user context.. is a human right..  These are global political issues way beyond the IETF's reach.

Nico
-- 

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux