--On Monday, 25 September, 2006 11:07 -0700 Lisa Dusseault <lisa@xxxxxxxxxxxxxxxxx> wrote: > On Sep 23, 2006, at 2:20 AM, Julian Reschke wrote: >> But as a matter of fact, draft-newman-i18n-comparator-14 >> doesn't define any collations that would actually solve the >> Unicode NF issue, so it's not really clear how this helps >> CalDAV (except that it now uses a framework in which the >> solution may become available in the future). Please watch for the final version of draft-iab-idn-nextsteps (probably to be posted as RFC 4690 within the next few days) and for draft-???-idnabis-issues-00 (soon). Neither "solves" the NF problem, but they may help make it more clear why the NF problem is not solvable in any general case. It can be solved for particular languages or, more specifically, particular orthographies of particular languages. But, as long as we are operating at the "Unicode" level, without specific language-identifying information transmitted in-band every time we transmit a string, there is no general solution. Fortunately, I don't believe that issue is in the critical path of the base comparator document. >> Maybe the set of initial registrations in >> <http://tools.ietf.org/ >> html/draft-newman-i18n-comparator-14#section-9> needs to be >> extended? > > Yes, I agree. That's one of the next steps and why a registry > was created (so we could do it outside the base comparator > draft). > > Last week Ted & I were discussing whether one could define a > Very Liberal Comparator (VLC) for general use. It would be > handy to have one which matched e with E, é, è É... and > matched o with O, ø, ô, and so on. That would help in > calendar searching use cases, e.g. a user who can't type in > accents (or doesn't know how) wants to find the invitation > from André by searching for "andre". It would probably be > useful in many other cross-language or unknown-language > situations too. Arggh. The difficulty here is that, for some scripts and languages a "decorated" version of a base characters can be, by convention or the natural properties of the language, replaced by an undecorated version. For others, the decorations actually form different characters, with different phonetics, names, and other properties (Unicode character names do not consistently reflect this distinction, partially because it is impossible to do to. To pull two examples out of idn-nextsteps, a very liberal comparator should let "ö" and "ø" match for some well-known Scandinavian languages and neither should match "o". But, in German, "ö" should generally match "oe", but not vice-versa. Perhaps it should match "o" as well, but that would be controversial. This set of problems actually gets worse as on moves outside Roman-derived scripts, even though the Roman-derived scripts probably have the richest collection of characters whose glyph forms are decorated versions of other characters. So, by all means do this if you think it is useful -- and I agree that it might be-- but please give it a value-neutral name, not, e.g., "very liberal". Again, not in the critical path for the comparator document, IMO. > Such a comparator would be most useful for exact and substring > matches; I don't know offhand how it would best do ordering so > it might not be as useful for ordering. Ordering is even more tightly tied to the "different character" versus "decorated version of existing character where the decorations are semi-optional" distinction. I suggest that trying to do it in a general way will lead either to frustration or to serious errors. > I believe Arnt intends to continue working on this general > problem, for which I'm very grateful, and other contributions > would be most welcome. I very much appreciate his efforts in the area, wish him luck, and hope that the community will be tolerant of efforts that meet specific needs and are clearly identified with those needs. john _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf