Re: Last Call: draft-ietf-sasl-scram

John C Klensin <john-ietf@xxxxxxx> · Tue, 15 Sep 2009 12:16:44 -0400

Kurt,

Just for clarification...  I fear that parts of this note are
going to be a mini-tutorial on some Unicode subtleties, but one
can't understand this without them and I suspect some interested
readers don't have that level of understanding. 

--On Tuesday, September 15, 2009 15:28 +0100 Kurt Zeilenga
<Kurt.Zeilenga@xxxxxxxxx> wrote:

> I strongly oppose such an 'or' as SASLprep and Net-UTF-8 uses
> different Unicode normalization algorithms.

Well, not really.

>...
> RFC 5198 says 'all character sequences SHOULD be normalized
> according to Unicode normalization form "NFC" (see Section 3).'
> RFC 4013 says 'This profile specifies using Unicode
> normalization form KC, as described in Section 4 of
> [StringPrep].'

First, I know that you know this but to be sure no one is
confused by a slight terminology difference, "normalization form
KC" and "normalization form NFKC" are exactly the same thing.
The latter is a little redundant, but commonly used.

Now, NFKC processing is a proper superset of NFC processing.
NFC provides what is called, in Unicode-speak, "canonical
composition" -- turning different ways of expressing exactly the
same character into a single standard form.   For example,
applying toNFC to Latin Small Letter U (U+0075) followed by
Combining Diaeresis (U+0308) yields Latin Small Letter U with
Diaeresis (U+00FC), while applying it to U+00FC yields U+00FC
itself.

Without NFC (or NFD, but that is another topic), simple string
comparisons may fail depending on how a character is entered at
the keyboard.   That is generally a bad idea.  Unless one
permits embedded new line characters in one's "character
sequences", the main difference between "just UTF-8" (RFC 3629)
and RFC 5198 is that the latter requires NFC-compliant strings.
3629 doesn't require NFC, much less NFKC (see below).

NFKC is a more complex operation, combining canonical
composition with "compatibility composition" -- replacement of
characters that Unicode has identified as being part of the
standard for compatibility purposes with their base forms.
There are a wide variety of compatibility characters.  Some,
such as East Asian width variations, are as surely "the same
character" as the U-with-Diaeresis example above.  Others are
the same (or not) only in context.  For example, there are a
large number of "mathematical" letter characters that, if used
in non-mathematical running text are simply font variations
(consider the relationship between Mathematical Bold Script
Small A (U+1D4EA) and Latin Small Letter A (U+0061)) but, if
used in mathematical contexts are fundamentally different
characters, at least according to several mathematical societies
and publishers.  Apply toNFKC to U+1D4EA and you get U+0061, but
applying toNFC to the same character yields itself.   And still
others are much more different.

However, any string in NFKC form is, by definition, in NFC form.

Now, with the understanding that this is a comment about
SASLprep rather than about the current I-D, but that it may be
part of Simon's motivation and certainly is part of mine, it is
really unclear whether applying the NFKC transformation to
things like identifiers and passphrases in security contexts is
a good idea.  If I know that I'm going to be in environments in
which I know how to type U+1D4EA and know that it can be
processed appropriately, it is a nearly ideal element of a
string used as a security identifier or passphrase: it, and its
many relatives, vastly increase the repertoire of characters
available to me and hence the potential entropy in such a
string, an attacker doing shoulder-surfing may not be able to
identify it or figure out how to type it in, and so on.

But the bottom line is that there is a pretty strict hierarchy
in terms of the number of permitted characters and their
representational forms:

	  UTF-8/ RFC 3629  (any Unicode code point, often with
	the same character able to be represented in different
	ways)

	  Net-UTF-8 / RFC 5198 (NFC-compliant strings; different
	code sequences for exactly the same character are
	eliminated; otherwise the same as UTF-8

	  SASLprep / RFC 4013 (NFKC-compliant strings; all
	"compatibility characters" are eliminated by being
	mapped into their base forms; otherwise the same as
	Net-UTF-8.

I think that means that...

(1) If you want to maximize interoperability, possibly at the
expense of some implementations getting things wrong as I
understood Simon to be concerned about, the rule should be
MUST... SASLprep.  Period, no exceptions.

(2) If you want to have reasonable odds of implementations that
do not support/use SASLprep working, the best answer is 
   MUST... Net-UTF-8, SHOULD SASLprep
or, if you prefer
   MUST ... NFC, SHOULD SASLprep
which, in a SASL context, will be indistinguishable in practice. 

(3) If you think even that strong a constraint is hopeless and
want to say something, then what should be said is
   MUST ... UTF8, SHOULD SASLprep but, if not SASLprep, SHOULD
NFC

I really don't think (3) is a good idea, but an unqualified
   MUST ... UTF8, SHOULD SASLprep
strikes me as a terrible idea simply because the same character,
coded in different ways through no fault of the user, may not
compare equal.  The difference between (1) and (2) is less
significant in practice because, while there are many important
exceptions (with those East Asian width variants probably
heading the list), the vast majority of compatibility characters
are very hard to type in most environments. And that was really
the point I was trying to make.

     john

_______________________________________________

Ietf@xxxxxxxx
https://www.ietf.org/mailman/listinfo/ietf