On Mon, Jun 04, 2018 at 08:56:36AM +1200, Brian E Carpenter wrote: > On 03/06/2018 18:13, Nico Williams wrote: > > I disagree. It's not a black art. There are some corners where > > reasonable people can and will disagree (should emoji be allowed in > > domainnames?), and there will be some cases that require script-specific > > expertise, and therefore a lot of time to sort out. But I18N is not a > > dark art at all. If it were, then how would we get anything done in > > that space? The E in IETF stands for Engineering, not Dark Art. > > We're in a space where the evaluation of A==B depends on more than > the bit strings A and B. Your post about form-insensitive filename > comparisons is a case in point, although I don't pretend to understand > it. OK, we can argue whether that's a dark art or simply complicated form_insensitive_strcmp(a, b) == memcmp(normalize(a), normalize(b)) Except that actually one can greatly optimize this to avoid most of the compute and memory cost of normalization. To see why consider comparing my first name as I usually write it (Nicolas) vs. how it should be written (Nicolás). The two strings should compare as not equivalent. But the two ways to write the second form (with the ´ precomposed vs. decomposed) should compare as equivalent (because they are). Notionally one can iterate the codepoints in the two strings and compare them, with a fast-path in the case where pairs of codepoints are byte-wise equal, and a slow path where they are not. In most cases all strings are in the same form (as input via whatever input methods), in which case equal strings compare as equal without having to use the slow path, and non-equivalent strings compare as non-equivalent with only as many slow-path executions as needed before the point where they differ. The slow path basically collects one non-combining codepoint as a many combining codepoints follow it, normalizes just that one character from each string and memcmp()s them. The slow path doesn't require allocation either, since there is a limit to how many codepoints a character can require. For non-equivalent ASCII-mostly strings this is very fast. Now, this optimization is pretty obvious -- it's the sort of thing engineers do. It's not a black art. Now, of course, deciding what characters to allow or forbid in some identifier... admits some subjectivity. E.g., whether to allow emoji in domainname labels. I would submit to you that we already permit emoji in domainname labels: what else are ideographs (Han/Kanji/whatever) if not pictographs (emojis) [that have been in use for a long time]? Is it not snobbish/elitist to say that you can have any Kanji you want but not a pictograph? Have you seen how the cool kids write? They are really 🆒, sometimes stringing along a sequence of emojis... much like one might Kanji. > engineering, but really what I need is (a) some generally applicable > guidelines on protocol design in this area and (b) some people willing to > review any relevant design work. I agree. For example, I've been saying for a long time that filesystem protocols should not specify a normalization for filenames and such. Instead the filesystems (not the protocol implementations) should use form- insensitive comparison. For a protocol like Kerberos... form- insensitive comparison doesn't quite work as well as just normalizing as soon as possible, so do that. I mean, normalization is not really a difficult thing anymore -- the code for it exists now. As to what characters to allow/forbid in what contexts, I do think we're better off getting the the Unicode Consortium (who are, arguably, the real experts in this) to do the heavy lifting there, and in our protocols to forbid as little as possible. Similarly for mappings. That's just for starters. I hope I've illustrated that I18N is not that much of a dark art. Nico --