Re: Possible BofF question -- I18n

Nico Williams <nico@xxxxxxxxxxxxxxxx> · Mon, 4 Jun 2018 22:10:24 -0500

On Mon, Jun 04, 2018 at 08:56:36AM +1200, Brian E Carpenter wrote:
> On 03/06/2018 18:13, Nico Williams wrote:
> > I disagree.  It's not a black art.  There are some corners where
> > reasonable people can and will disagree (should emoji be allowed in
> > domainnames?), and there will be some cases that require script-specific
> > expertise, and therefore a lot of time to sort out.  But I18N is not a
> > dark art at all.  If it were, then how would we get anything done in
> > that space?  The E in IETF stands for Engineering, not Dark Art.
> 
> We're in a space where the evaluation of A==B depends on more than
> the bit strings A and B. Your post about form-insensitive filename
> comparisons is a case in point, although I don't pretend to understand
> it. OK, we can argue whether that's a dark art or simply complicated

  form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b))

Except that actually one can greatly optimize this to avoid most of the
compute and memory cost of normalization.

To see why consider comparing my first name as I usually write it
(Nicolas) vs.  how it should be written (Nicolás).  The two strings
should compare as not equivalent.  But the two ways to write the second
form (with the &acute; precomposed vs. decomposed) should compare as
equivalent (because they are).

Notionally one can iterate the codepoints in the two strings and compare
them, with a fast-path in the case where pairs of codepoints are
byte-wise equal, and a slow path where they are not.

In most cases all strings are in the same form (as input via whatever
input methods), in which case equal strings compare as equal without
having to use the slow path, and non-equivalent strings compare as
non-equivalent with only as many slow-path executions as needed before
the point where they differ.

The slow path basically collects one non-combining codepoint as a many
combining codepoints follow it, normalizes just that one character from
each string and memcmp()s them.  The slow path doesn't require
allocation either, since there is a limit to how many codepoints a
character can require.

For non-equivalent ASCII-mostly strings this is very fast.

Now, this optimization is pretty obvious -- it's the sort of thing
engineers do.  It's not a black art.

Now, of course, deciding what characters to allow or forbid in some
identifier... admits some subjectivity.  E.g., whether to allow emoji in
domainname labels.  I would submit to you that we already permit emoji
in domainname labels: what else are ideographs (Han/Kanji/whatever) if
not pictographs (emojis) [that have been in use for a long time]?  Is it
not snobbish/elitist to say that you can have any Kanji you want but not
a pictograph?  Have you seen how the cool kids write?  They are really
🆒, sometimes stringing along a sequence of emojis... much like one might
Kanji.

> engineering, but really what I need is (a) some generally applicable
> guidelines on protocol design in this area and (b) some people willing to
> review any relevant design work.

I agree.

For example, I've been saying for a long time that filesystem protocols
should not specify a normalization for filenames and such.  Instead the
filesystems (not the protocol implementations) should use form-
insensitive comparison.  For a protocol like Kerberos...  form-
insensitive comparison doesn't quite work as well as just normalizing as
soon as possible, so do that.  I mean, normalization is not really a
difficult thing anymore -- the code for it exists now.

As to what characters to allow/forbid in what contexts, I do think we're
better off getting the the Unicode Consortium (who are, arguably, the
real experts in this) to do the heavy lifting there, and in our
protocols to forbid as little as possible.  Similarly for mappings.

That's just for starters.  I hope I've illustrated that I18N is not that
much of a dark art.

Nico
--