On 6/4/2018 10:50 PM, Barry Leiba
wrote:
.. a long list of questions, such as - we say that "nicolas" is not equivalent to "nicolás" - but we say that "nicolás" *is* equivalent to "nicola´s", and we handle this using normalization - does that mean that it's OK to have "nicolas" and "nicolás" as two different usernames assigned to two different users? - if yes, how do we deal with the human interface issues involved? What happens if the human identified as "nicolás" uses an input mechanism that doesn't have a way to enter "á"? How can he log in? - if no, how do we make sure (in an automated way) that we don't make that assignment? - does the answer change if "nicolás" is a domain name instead of a username? - does the answer change if "nicolás" is a *password*? - and what about "nicolàs"? and "nicolâs"? and "nicoläs"? - what about "nicolаs" (that's a Cyrillic character in the penultimate position)? - what about "nicolαs" (that's a Greek character in the penultimate position)? - what about other Unicode characters that look like "a", either exactly (as with Cyrillic) or closely (as with Greek)? - what about handling of "ä" vs "ae"? Do we want to avoid assigning "käse" and "kaese" as distinct usernames? Does the answer to this differ depending upon whether the language is German (where using "ae" to represent "ä" is common) or Swedish (where it is not)? When I look at these questions, I can't help thinking that we are trying to deal with human interface issues at the wrong layer. Or rather, that there are some layers at which the human interface issues are paramount, and some layers at which it is much better to deal with binary strings. For example, if I were writing a mail UI, I would be very concerned with the representation of names and other strings. But then I would have tools. I can consult with interaction designers, I can run the proposed UI designs through user panels, I can design specific UI for specific subsets of users, I can get feedback from beta users, I can analyze the telemetry, I can push software updates to fix my inevitable mistakes. On the other hand, I am writing an SMTP MTA, a DNS recursive resolver, or a SIP server, I don't have any of those tools at my disposal. My server is suppose to exactly implement the specified protocol. I will only get indirect feedback from users who maybe are not even aware of the server's presence. I will get telemetry about my server's performance, but I won't be able to measure the level of befuddlement of the users whose packets were processed. Forty years ago, we started a path on a slippery slope with a basic normalization process -- considering lower and upper case letters as equivalent. That was probably justified by the hardware of the time, when some devices could only produce upper case letters, something like the Telex alphabet. But we slipped on the slope with enthusiasm, embedding case insensitive comparisons in all kind of protocols, and then attempting to extend the concept piecemeal to a variety of languages. In hindsight, that was a bad idea. It leads to an expectation that intermediaries not only can "normalize" character strings, but are expected to do it. Barry gives some great examples of that silliness with variations of European alphabets, but if I understand correctly the same games can be played with Arabic/Persian letters or with variations of the Chinese characters, and probably with quite a number of different scripts. Text comparison looks fundamentally like a human interaction engineering issue, and a very hard one at that. I can't believe for a minute that engineers writing code for message passing servers will deal with that sort of problem without making a mess of it. Besides, it is not obvious at all that there is one single right answer to these questions. So my BofF question would not be "how to educate the engineers on the fine points of normalizing Unicode strings", but rather, can we layer the designs so that "the network" handles binary, and only specialized systems handle the mapping from binary to "meaning"? -- Christian Huitema |