Kurt, Just for clarification... I fear that parts of this note are going to be a mini-tutorial on some Unicode subtleties, but one can't understand this without them and I suspect some interested readers don't have that level of understanding. --On Tuesday, September 15, 2009 15:28 +0100 Kurt Zeilenga <Kurt.Zeilenga@xxxxxxxxx> wrote: > I strongly oppose such an 'or' as SASLprep and Net-UTF-8 uses > different Unicode normalization algorithms. Well, not really. >... > RFC 5198 says 'all character sequences SHOULD be normalized > according to Unicode normalization form "NFC" (see Section 3).' > RFC 4013 says 'This profile specifies using Unicode > normalization form KC, as described in Section 4 of > [StringPrep].' First, I know that you know this but to be sure no one is confused by a slight terminology difference, "normalization form KC" and "normalization form NFKC" are exactly the same thing. The latter is a little redundant, but commonly used. Now, NFKC processing is a proper superset of NFC processing. NFC provides what is called, in Unicode-speak, "canonical composition" -- turning different ways of expressing exactly the same character into a single standard form. For example, applying toNFC to Latin Small Letter U (U+0075) followed by Combining Diaeresis (U+0308) yields Latin Small Letter U with Diaeresis (U+00FC), while applying it to U+00FC yields U+00FC itself. Without NFC (or NFD, but that is another topic), simple string comparisons may fail depending on how a character is entered at the keyboard. That is generally a bad idea. Unless one permits embedded new line characters in one's "character sequences", the main difference between "just UTF-8" (RFC 3629) and RFC 5198 is that the latter requires NFC-compliant strings. 3629 doesn't require NFC, much less NFKC (see below). NFKC is a more complex operation, combining canonical composition with "compatibility composition" -- replacement of characters that Unicode has identified as being part of the standard for compatibility purposes with their base forms. There are a wide variety of compatibility characters. Some, such as East Asian width variations, are as surely "the same character" as the U-with-Diaeresis example above. Others are the same (or not) only in context. For example, there are a large number of "mathematical" letter characters that, if used in non-mathematical running text are simply font variations (consider the relationship between Mathematical Bold Script Small A (U+1D4EA) and Latin Small Letter A (U+0061)) but, if used in mathematical contexts are fundamentally different characters, at least according to several mathematical societies and publishers. Apply toNFKC to U+1D4EA and you get U+0061, but applying toNFC to the same character yields itself. And still others are much more different. However, any string in NFKC form is, by definition, in NFC form. Now, with the understanding that this is a comment about SASLprep rather than about the current I-D, but that it may be part of Simon's motivation and certainly is part of mine, it is really unclear whether applying the NFKC transformation to things like identifiers and passphrases in security contexts is a good idea. If I know that I'm going to be in environments in which I know how to type U+1D4EA and know that it can be processed appropriately, it is a nearly ideal element of a string used as a security identifier or passphrase: it, and its many relatives, vastly increase the repertoire of characters available to me and hence the potential entropy in such a string, an attacker doing shoulder-surfing may not be able to identify it or figure out how to type it in, and so on. But the bottom line is that there is a pretty strict hierarchy in terms of the number of permitted characters and their representational forms: UTF-8/ RFC 3629 (any Unicode code point, often with the same character able to be represented in different ways) Net-UTF-8 / RFC 5198 (NFC-compliant strings; different code sequences for exactly the same character are eliminated; otherwise the same as UTF-8 SASLprep / RFC 4013 (NFKC-compliant strings; all "compatibility characters" are eliminated by being mapped into their base forms; otherwise the same as Net-UTF-8. I think that means that... (1) If you want to maximize interoperability, possibly at the expense of some implementations getting things wrong as I understood Simon to be concerned about, the rule should be MUST... SASLprep. Period, no exceptions. (2) If you want to have reasonable odds of implementations that do not support/use SASLprep working, the best answer is MUST... Net-UTF-8, SHOULD SASLprep or, if you prefer MUST ... NFC, SHOULD SASLprep which, in a SASL context, will be indistinguishable in practice. (3) If you think even that strong a constraint is hopeless and want to say something, then what should be said is MUST ... UTF8, SHOULD SASLprep but, if not SASLprep, SHOULD NFC I really don't think (3) is a good idea, but an unqualified MUST ... UTF8, SHOULD SASLprep strikes me as a terrible idea simply because the same character, coded in different ways through no fault of the user, may not compare equal. The difference between (1) and (2) is less significant in practice because, while there are many important exceptions (with those East Asian width variants probably heading the list), the vast majority of compatibility characters are very hard to type in most environments. And that was really the point I was trying to make. john _______________________________________________ Ietf@xxxxxxxx https://www.ietf.org/mailman/listinfo/ietf