Re: Possible BofF question -- I18n

Christian Huitema <huitema@xxxxxxxxxxx> · Mon, 4 Jun 2018 23:33:46 -0700



    On 6/4/2018 10:50 PM, Barry Leiba
      wrote:

      
      .. a long list of questions, such as 

    
      - we say that "nicolas" is not equivalent to "nicolás"
- but we say that "nicolás" *is* equivalent to "nicola´s", and we
handle this using normalization
- does that mean that it's OK to have "nicolas" and "nicolás" as two
different usernames assigned to two different users?
- if yes, how do we deal with the human interface issues involved?
What happens if the human identified as "nicolás" uses an input
mechanism that doesn't have a way to enter "á"?  How can he log in?
- if no, how do we make sure (in an automated way) that we don't make
that assignment?
- does the answer change if "nicolás" is a domain name instead of a username?
- does the answer change if "nicolás" is a *password*?
- and what about "nicolàs"?  and "nicolâs"?  and "nicoläs"?
- what about "nicolаs" (that's a Cyrillic character in the penultimate
position)?
- what about "nicolαs" (that's a Greek character in the penultimate position)?
- what about other Unicode characters that look like "a", either
exactly (as with Cyrillic) or closely (as with Greek)?
- what about handling of "ä" vs "ae"?  Do we want to avoid assigning
"käse" and "kaese" as distinct usernames?  Does the answer to this
differ depending upon whether the language is German (where using "ae"
to represent "ä" is common) or Swedish (where it is not)?
    
    
    When I look at these questions, I can't help thinking that we are
    trying to deal with human interface issues at the wrong layer. Or
    rather, that there are some layers at which the human interface
    issues are paramount, and some layers at which it is much better to
    deal with binary strings. 

    
    For example, if I were writing a mail UI, I would be very concerned
    with the representation of names and other strings. But then I would
    have tools. I can consult with interaction designers, I can run the
    proposed UI designs through user panels, I can design specific UI
    for specific subsets of users, I can get feedback from beta users, I
    can analyze the telemetry, I can push software updates to fix my
    inevitable mistakes.

    
    On the other hand, I am writing an SMTP MTA, a DNS recursive
    resolver, or a SIP server, I don't have any of those tools at my
    disposal. My server is suppose to exactly implement the specified
    protocol. I will only get indirect feedback from users who maybe are
    not even aware of the server's presence. I will get telemetry about
    my server's performance, but I won't be able to measure the level of
    befuddlement of the users whose packets were processed.

    
    Forty years ago, we started a path on a slippery slope with a basic
    normalization process -- considering lower and upper case letters as
    equivalent. That was probably justified by the hardware of the time,
    when some devices could only produce upper case letters, something
    like the Telex alphabet. But we slipped on the slope with
    enthusiasm, embedding case insensitive comparisons in all kind of
    protocols, and then attempting to extend the concept piecemeal to a
    variety of languages.

    
    In hindsight, that was a bad idea. It leads to an expectation that
    intermediaries not only can "normalize" character strings, but are
    expected to do it. Barry gives some great examples of that silliness
    with variations of European alphabets, but if I understand correctly
    the same games can be played with Arabic/Persian letters or with
    variations of the Chinese characters, and probably with quite a
    number of different scripts.

    
    Text comparison looks fundamentally like a human interaction
    engineering issue, and a very hard one at that. I can't believe for
    a minute that engineers writing code for message passing servers
    will deal with that sort of problem without making a mess of it.
    Besides, it is not obvious at all that there is one single right
    answer to these questions. So my BofF question would not be "how to
    educate the engineers on the fine points of normalizing Unicode
    strings", but rather, can we layer the designs so that "the network"
    handles binary, and only specialized systems handle the mapping from
    binary to "meaning"?

    
    -- Christian Huitema