Re: [idn] Re: 7 bits forever!

Gunnar Lindberg <lindberg@cdg.chalmers.se> · Sat, 6 Apr 2002 11:55:50 +0200 (MET DST)

Case insensitive lookup is likely to be a similar problem regardless
of whether you use "raw" UTF-8 or you encode things in 7-bit ASCII.

To me the real issue, however, seems to be in the applications, i.e.
in the resolvers (libresolv.*/gethostbyname()/gethostbyaddr()) and
all the places where the resulting names are to be processed
    netdb.h:
    char *h_name;
    char **h_aliases";

It would be, er, hm, unwise to assume that all applications would do
The Right Thing (TM) when the "char *" starts to carry UTF-8 data
(which is 8-bit, unless ASCII equivalent). And, do remember that
these strings can come to you without you asking for them, it's NOT
www URLs only - just think of spam "From nobody@r�ksm�rg�s-reklam.se"
(SE 8859-1, although nicely encoded in UTF-8, 8-bit).

Now, the IT industri has been declining the last last months, but I
doubt it would do good to create such massive amount of extra work.

MIME QP - agreed it's a kludge, agreed it's uggly - did save us from
a lot of problems (i.e. no Flag Day needed). Don't forget that.

	Gunnar Lindberg

>From listadm@loki.ietf.org  Fri Apr  5 22:21:17 2002
>Date: Fri, 05 Apr 2002 14:41:53 -0500
>From: John C Klensin <klensin@jck.com>
>To: Robert Elz <kre@munnari.OZ.AU>
>cc: ietf@IETF.ORG
>Subject: Re: [idn] Re: 7 bits forever! 
>Message-ID: <9863660.1018017713@localhost>
>In-Reply-To: <2228.1018021996@brandenburg.cs.mu.OZ.AU>
>References:  <2228.1018021996@brandenburg.cs.mu.OZ.AU>

>--On Friday, 05 April, 2002 22:53 +0700 Robert Elz
><kre@munnari.OZ.AU> wrote:

>>     Date:        Thu, 4 Apr 2002 09:50:01 -0800 (PST)
>>     From:        "Gary E. Miller" <gem@rellim.com>
>>     Message-ID:
>> <Pine.LNX.4.44.0204040931110.10828-100000@catbert.rellim.com>
>> 
>>   | Maybe it can, but that does not make it right.
>>   | 
>>   | RFC 1035 "DOMAIN NAMES - IMPLEMENTATION AND SPECIFICATION"
>>   | 
>>   | 2.3.1
>> 
>> If you actually go read that section, carefully, instead of
>> just quoting the part from it that everyone notices first, you
>> will see that it says something quite different from what you
>> think it does.
>> 
>> You need to read the part of the section that appears on the
>> preceding page of the formatted RFC...
>> 
>> Or see (part of) rfc2181 for a longer verison of this.

>Actually, having read that section, and several other sections,
>_very_ carefully in recent months, I think 2181 is contradictory
>at best, and possibly seriously wrong, on this point.

>As I read them, what 1034 and 1035 say is that the DNS can
>accomodate any octets, but that [at least then]
>currently-specified RRs are restricted to ASCII.  The LDH rule
>is a good ("best"?) practices one.  It is the LDH rule that RFC
>1123 modified slightly.  And it is quite correct to assert that
>the LDH rule is not a _DNS_ requirement.

>But the ASCII rule is a firm requirement.  For evidence of this,
>temporarily ignore the text (although, personally, I think it is
>clear -- especially in 2.3.3-- if read carefully) and examine
>the requirement that, for the defined RRs, labels and queries be
>compared in a case-insensitive way.  For ASCII, that is a
>well-defined operation, one that can be performed by doing the
>comparison under a bit mask.  For other scripts, as the IDN WG
>discovered, "case insensitive comparison" is typically not
>completely well-defined, often involves complex tables and/or
>knowledge of local context, and is sometimes quite controversial
>as to what is intended.

>So I believe that the "future RRs" language with regard to
>binary labels in 1034 and 1035 must be taken seriously and as
>normative text: if new RRs (or new classes) are defined, they
>can be defined as binary and, hence, as not requiring
>case-insensitive comparisons.  Conversely, within the current
>set (or at least the historical set at the time of 1034/1035),
>case-insensitive comparison is required and hence binary must
>not be permitted.

>Any other reading, I believe, leads immediately either to
>contradictions or to undefined states within the protocol.

>As an aside, it appears to me that this requirement for
>case-insensitive comparison is the real problem with "just put
>UTF-8 in the DNS" approaches.  An existing and conforming
>implementation has no way to do those required case-insensitive
>comparisons outside the ASCII range.  Worse, if it does those
>comparisons by bit-masking (which would be conforming today),
>there is a risk of its getting rather bizarre errors (of either
>matching or not matching) on characters outside the ASCII range.
>One supposes that we could modify the protocol to specify that
>case-insensitive comparisions be made only for octets in the
>ASCII range, but, unless that were done through an EDNS option,
>it would be a potentially fairly significant retroactive change.

>    john