Re: [idn] Re: 7 bits forever!

John C Klensin <klensin@jck.com> · Sat, 06 Apr 2002 11:21:59 -0500

--On Saturday, 06 April, 2002 18:44 +0700 Robert Elz
<kre@munnari.OZ.AU> wrote:

>     Date:        Fri, 05 Apr 2002 14:41:53 -0500
>     From:        John C Klensin <klensin@jck.com>
>     Message-ID:  <9863660.1018017713@localhost>
> 
> I really hoped to be able to avoid having to do this, yet
> again...

And I apologize for taking a shortcut because I didn't want to
take the time to pull out the text again.  
> 
>   | As I read them, what 1034 and 1035 say is that the DNS can
>   | accomodate any octets, but that [at least then]
>   | currently-specified RRs are restricted to ASCII.
> 
> Sorry John, I can't fathom how you could possibly reach that
> conclusion from what is in 1034 & 1035.

See below.
> 
> Eg: from 1035 (section 3.1) ...
> 
> 	Although labels can contain any 8 bit values in octets that
> make up a 	label, it is strongly recommended that labels
> follow the preferred 	syntax described elsewhere in this memo,
> which is compatible with 	existing host naming conventions.
> 
> How much clearer do you want it?

Yes, that is one of the "LDH rules are a good practice" sections
I was referring to.  And

> 1034 is less clear, but (section 3.5 --- note its title)
> 
>   3.5. Preferred name syntax
> 
>   The DNS specifications attempt to be as general as possible
> in the rules   for constructing domain names.  The idea is
> that the name of any   existing object can be expressed as a
> domain name with minimal changes.   However, when assigning a
> domain name for an object, the prudent user   will select a
> name which satisfies both the rules of the domain system   and
> any existing rules for the object, whether these rules are
> published   or implied by existing programs.
> 
> "The prudent user will select" ...  that is, this is a damn
> good idea, but you don't have to do it if you know what you're
> doing.

And that is the other one.  We are in complete agreement about
this.

>   | The LDH rule is a good ("best"?) practices one.
> 
> It is required (as updated) if the domain name is to be used
> in an e-mail header (which back then, was almost the only
> other formalised place that domain names appeared - other than
> that was all OS specific command/arg stuff).

Hmm.  I'd argue that the other existing protocols count[ed] --
FTP and Telnet connection specifications, two-level specs in
Finger and Whois, etc., but I think this isn't important.

>   | But the ASCII rule is a firm requirement.
> 
> No it isn't, there is nothing at all which says that.

See below.

>   | For evidence of this, temporarily ignore the text
> 
> Hmm - ignore what is written, and attempt to infer from
> something else...

As I said, trying to take a shortcut and to provide some
additional evidence of intent/logical consistency in case the
sections that I think are important are read as contradicting
those that you cite above.  I don't read the spec as
contradictory, but as specifying three sets of rules: LDH
(recommended), ASCII (required for "existing" RRs), binary
(possible future extensions in new RRs).

>   | (although, personally, I think it is clear -- especially
> in 2.3.3--   | if read carefully)
> 
> 2.3.3 is about character case, and I agree, that is a very
> messy area indeed.
> 
>   | and examine
>   | the requirement that, for the defined RRs, labels and
> queries be   | compared in a case-insensitive way.
> 
> Not quite.   What it says that ascii labels (ones with the top
> bit clear) must be handled that way, it carefully refrains
> from saying what should be done in other cases - leaving that
> for future definition (which is kind of what this recent work
> has all been about).   However, it clearly allows non-ascii
> labels - it just doesn't specify what they mean, or how to
> interpret them.   That's what needed to be done to allow
> non-ascii names to have some kind of meaning.

More on this below (I'm going to paste in an earlier analysis,
with the text cited, rather than screwing it up by trying to
reconstruct) but I don't see anything in the text that says "if
you see the high bit set, you can assume it is binary and other
rules apply; if the high bit is zero, then it is ASCII and needs
case-independent comparison".  It seems to me that a statement
of that general nature would be needed to justify your assertion
above.  I note with interest that even 2181 doesn't seem to
include such a statement as a clarification of what is an "ascii
label" and what is a "binary label".  What it says instead
(section 11) is

		Those restrictions aside, any binary string whatever can
		be used as the label of any resource record.  Similarly,
		any binary string can serve as the value of any record
		that includes a domain name as some or all of its value
		(SOA, NS, MX, PTR, CNAME, and any others that may be
		added).

and, from the abstract, where "the last two" refer to the
canonical name issue and the valid contents of a labeL:

		The other two are already adequately specified, however
		the specifications seem to be sometimes ignored.  We
		seek to reinforce the existing specifications.

>From which I assume that 2181 did not intend to change anything
about 1034/1035 in this area and that its approval by the IESG
was based on that assumption.

>  | So I believe that the "future RRs" language with regard to
>   | binary labels in 1034 and 1035 must be taken seriously and
> as   | normative text: if new RRs (or new classes) are
> defined, they   | can be defined as binary and,
> 
> Have you actually thought about what you have just said?
> That is, the rules for naming the DNS tree depend upon the
> data that is stored there?
> 
> Do you seriously mean that?

I think what I'm suggesting is that the valid content of a given
label depends on the RR type (and Class) with which it is
associated.  One can question the wisdom of that in retrospect,
but that it what the specification says.

> Classes are a whole other mess, that no-one really seems to
> understand, one of those "this might be a good idea" frills,
> that is completely undefined. It isn't clear whether different
> classes share the same namespace or not (just they they share
> a few RR type definitions).   Classes are essentially extinct.

We could debate that too, but I agree that it does not seem
important at this stage, except, perhaps, to understanding where
binary labels might be used.

>   | hence, as not requiring
>   | case-insensitive comparisons.  Conversely, within the
> current   | set (or at least the historical set at the time of
> 1034/1035),   | case-insensitive comparison is required and
> hence binary must   | not be permitted.
> 
> Case insensitive comparison of ascii is required, what is done
> with the rest is undefined.   To make it meaningful it needs
> to be defined, that I agree with.
> 
> One easy (though perhaps not desirable, I don't know) solution
> would be to simply restrict the case insensitive part, as far
> as the DNS is concerned, to ascii only, so that A==a but Á!=á.
> Eventually doing away with case insensitive for all labels
> seems like a good idea to me.

Of course, to support "case insensitivity for ASCII only", it
would be nice to have an algorithmic rule for identifying ASCII.
But binary labels can, in principle, have octets with the high
bit clear, or even all octets with the high bit clear.  And one
does not want to apply case-insensitivity matching to binary
labels, not matter how they are structured.  So, I believe, in
logic, that one needs to know, on a per-RR type (or, in
principle, per-query-type or other per-query) basis, whether the
comparison involves character comparison (hence case insensitive
over at least some of the octets) or binary comparison
(comparison of bits, no fussing).

I can't comment on whether doing away with case insensitivity is
a good idea, since I can argue either for of against it in new
applications.  But the transition from case insensitive
comparison to case sensitive (or binary) comparison would be a
very interesting exercise.

>   | Any other reading, I believe, leads immediately either to
>   | contradictions or to undefined states within the protocol.
> 
> Undefined, yes.   That's not unusual, lots of protocols have
> undefined states.

See below.

>   | As an aside, it appears to me that this requirement for
>   | case-insensitive comparison is the real problem with "just
> put   | UTF-8 in the DNS" approaches.
> 
> Not really - what causes the problem is putting more than
> ascii there. As soon as you permit that, you have to deal with
> all of the issues. The way the bytes are encoded is
> irrelevant.   One way out of this is to require that the DNS
> always use the "lower" case (whatever that happens to be in
> any particular instance - that is, whenever multiple
> characters are generally assumed to mean the same, pick one as
> the one that must always be used within the DNS) and have the
> resolver enforce it.   Whether the data once chosen is encoded
> in UTF-8 or some other way is irrelevant.

Except that some of those "other ways" may result in octets with
the high bit clear that do not represent ASCII characters
(assuming, as you do, that 1034/1035 require case insensitive
comparison for ASCII only).  In other words, the DNS needs to
know something about the encoding in order to know when to apply
case insensitive comparison (and, potentially, how to do it).

> The problem with doing this is that it requires every resolver
> to be able to handle every possible case mapping (for any
> domain that it may ever encounter - which is all of them, or
> course).   On the other hand, doing it in the server only
> requires the server to understand the case folding rules for
> the actual domain names it serves, not necessarily anyone
> else's (back end caches have a problem either way of course)

I think this is correct.   While I haven't done the analysis, my
intuition tells me that, if we are going to go down this path on
the server side, we may have big problems potentially-recursive
RRs like DNAME and NAPTR, but that is a separate problem.  I
hope.

> In any case, these are the issues that a WG that was tasked
> with defining how the DNS should treat non ascii labels should
> be dealing with. Currently, there's none of that happening -
> idn simply decided not to bother, and make everything inside
> the DNS remain ascii forever. (Recently I have seen some
> ramblings about long term conversion from ACE to UTF-8 inside
> the DNS - that's a ludicrous prospect that can never happen).

Yes.

>   | An existing and conforming
>   | implementation has no way to do those required
> case-insensitive   | comparisons outside the ASCII range.
> 
> No, nor is it required to.

There we probably disagree -- I suggest that the text is at
least ambiguous and might require it.  But, at some level, it
isn't important, because the text clearly prohibits non-ASCII
labels in "existing" RRs.  See below.

>   | One supposes that we could modify the protocol to specify
> that   | case-insensitive comparisions be made only for octets
> in the   | ASCII range, but, unless that were done through an
> EDNS option,   | it would be a potentially fairly significant
> retroactive change.
> 
> That's not actually a modification, that's what is currently
> required.

Not my reading of sections you didn't cite.  See below.

That earlier analysis (slightly updated) and the text
citations...

[...]

... and that has led me to carefully re-read old text.
That, in turn, leads to a question: it is very clear that
nothing in the DNS spec requires the LDH rule, even though it
appears as "prudent user" guidance in section 2.3.1 of RFC 1035
(and elsewhere). But it appears to me that binary labels are not
permitted on the common RR types, for at least one
technically-rational reason, and that 2181 glosses this over a
bit.

Specifically...

>From RFC1034, section 3.1

		By convention, domain names can be stored with arbitrary
		case, but domain name comparisons for all present domain
		functions are done in a case-insensitive manner,
		assuming an ASCII character set, and a high order zero
		bit.  This means that you are free to create a node with
		label "A" or a node with label "a", but not both as
		brothers; you could refer to either using "a" or "A".
		When you receive a domain name or label, you should
		preserve its case.  The rationale for this choice is
		that we may someday need to add full binary domain names
		for new services; existing services would not be
		changed.

That statement is presumably part of your justification for
assuming that all bets are off if the high order bit is on.
Whether that is important depends on what "existing services"
refers to, plus the problem of binary labels that don't happen
to contain octets with the high bit set and how they are to be
recognized and thence compared.

and RFC1035:

		2.3.3. Character Case

		For all parts of the DNS that are part of the official
		protocol, all comparisons between character strings
		(e.g., labels, domain names, etc.) are done in a
		case-insensitive manner.  At present, this rule is in
		force throughout the domain system without exception.
		However, future additions beyond current usage may need
		to use the full binary octet capabilities in names, so
		attempts to store domain names in 7-bit ASCII or use of
		special bytes to terminate labels, etc., should be
		avoided.

I'm inclined to read "additions beyond current usage" as
implying new RRs or new Classes, you are inclined to read it as
having octets with the high bit on appear in existing RRs.  It
seems to me that this is at least a bit ambiguous, rather than
crystal-clear in the latter direction.  More important, it
appears to me to make a clear (and necessary) distinction
between "character strings" and "full binary octet capabilities"
in the DNS, to require case-insensitive comparison only for the
former, and hence to require that one be able to tell the
difference unambiguously.

But the first part of this does say "For all parts of the DNS
that are part of the official protocol, all comparisons between
character strings ...  are done in a case-insensitive manner."
To emphasize, that is "all parts" and "all comparisons", not
"unless you happen to find the high bit turned on".  So, in the
absence of some standards-track document that changes the
comparison rule -- either for new RRs or retroactively for
existing ones -- it seems to me that we are stuck with it.  And
that "However" sentence seems to apply to storage forms in
implementations, not to what is permitted in labels or queries.

[...]
The requirement to do case-mapping is, I think, ultimately a
restriction on the labels.  It makes it hard for me to think
about the interpretation of a binary label unless the label is
specified as "binary" as part of the description of the
associated RR.  Indeed, given understanding we have gained with
the IDN WG (which PVM probably didn't have when 1034/1035 and
their predecessors were written), it makes it hard for me to
think about anything but ASCII for anything but new RRs (or,
potentially, classes).  Moreover, the text of 1034/1035 appears
to me to require ASCII labels for all RR types specified in
those documents, and maybe even for all new RR types that don't
explicitly specify binary labels.

And, if after going through this, you find that we are still
reading the text differently, I suggest that 2181 probably need
updating to clarify how one of those "any binary string" labels
are to be interpreted when they appear in queries that require
case-insensitive matching.  Otherwise, we have what appears to
be a very strong statement about what is permitted with no
specification at all about how it is handled if one appears.
That doesn't seem to me to be the path to interoperability.

     john