> On 6/4/2018 10:50 PM, Barry Leiba wrote: > .. a long list of questions, such as > > - we say that "nicolas" is not equivalent to "nicolás" > > - but we say that "nicolás" *is* equivalent to "nicola´s", and we > > handle this using normalization > > - does that mean that it's OK to have "nicolas" and "nicolás" as two > > different usernames assigned to two different users? > > - if yes, how do we deal with the human interface issues involved? > > What happens if the human identified as "nicolás" uses an input > > mechanism that doesn't have a way to enter "á"? How can he log in? > > - if no, how do we make sure (in an automated way) that we don't make > > that assignment? > > - does the answer change if "nicolás" is a domain name instead of a username? > > - does the answer change if "nicolás" is a *password*? > > - and what about "nicolàs"? and "nicolâs"? and "nicoläs"? > > - what about "nicolаs" (that's a Cyrillic character in the penultimate > > position)? > > - what about "nicolαs" (that's a Greek character in the penultimate position)? > > - what about other Unicode characters that look like "a", either > > exactly (as with Cyrillic) or closely (as with Greek)? > > - what about handling of "ä" vs "ae"? Do we want to avoid assigning > > "käse" and "kaese" as distinct usernames? Does the answer to this > > differ depending upon whether the language is German (where using "ae" > > to represent "ä" is common) or Swedish (where it is not)? > When I look at these questions, I can't help thinking that we are trying > to deal with human interface issues at the wrong layer. Or rather, that > there are some layers at which the human interface issues are paramount, > and some layers at which it is much better to deal with binary strings. And some where they get conflated. > For example, if I were writing a mail UI, I would be very concerned with > the representation of names and other strings. But then I would have > tools. I can consult with interaction designers, I can run the proposed > UI designs through user panels, I can design specific UI for specific > subsets of users, I can get feedback from beta users, I can analyze the > telemetry, I can push software updates to fix my inevitable mistakes. > On the other hand, I am writing an SMTP MTA, a DNS recursive resolver, > or a SIP server, I don't have any of those tools at my disposal. I can't speak to the DNS resolver or SIP server, but in the case of an MTA, this is incorrect. The fact is MTAs deal with quite a few i18n issues, especially if you implement standards like EAI. > My > server is suppose to exactly implement the specified protocol. I will > only get indirect feedback from users who maybe are not even aware of > the server's presence. I will get telemetry about my server's > performance, but I won't be able to measure the level of befuddlement of > the users whose packets were processed. I can assure you user bufuddlement is carefully monitored (support calls being a cost center and all), and if those measurments point at a component like an MTA being the cause, it gets communicated. Loudly. > Forty years ago, we started a path on a slippery slope with a basic > normalization process -- considering lower and upper case letters as > equivalent. That was probably justified by the hardware of the time, > when some devices could only produce upper case letters, something like > the Telex alphabet. But we slipped on the slope with enthusiasm, > embedding case insensitive comparisons in all kind of protocols, and > then attempting to extend the concept piecemeal to a variety of languages. It's also justified by user expectations. Like it or not, people aren't computers and aren't terribly good at retaining case. As I said in my previous response, it's going to be really interesting to see how this plays out with EAI addresses. > In hindsight, that was a bad idea. It leads to an expectation that > intermediaries not only can "normalize" character strings, but are > expected to do it. Barry gives some great examples of that silliness > with variations of European alphabets, but if I understand correctly the > same games can be played with Arabic/Persian letters or with variations > of the Chinese characters, and probably with quite a number of different > scripts. > Text comparison looks fundamentally like a human interaction engineering > issue, and a very hard one at that. I can't believe for a minute that > engineers writing code for message passing servers will deal with that > sort of problem without making a mess of it. Besides, it is not obvious > at all that there is one single right answer to these questions. Actually, it's quite clear that there is no single "right" answer. But it's also cleat that there needs to be single answer. Because as I said before, address comparison has to work for all addresses, even ones outside your administrative domain. > So my > BofF question would not be "how to educate the engineers on the fine > points of normalizing Unicode strings", but rather, can we layer the > designs so that "the network" handles binary, and only specialized > systems handle the mapping from binary to "meaning"? If you include infrastructure components like MTAs in your definition of "network", the answer is no, you can't. Ned