--On Saturday, 06 April, 2002 18:44 +0700 Robert Elz <kre@munnari.OZ.AU> wrote: > Date: Fri, 05 Apr 2002 14:41:53 -0500 > From: John C Klensin <klensin@jck.com> > Message-ID: <9863660.1018017713@localhost> > > I really hoped to be able to avoid having to do this, yet > again... And I apologize for taking a shortcut because I didn't want to take the time to pull out the text again. > > | As I read them, what 1034 and 1035 say is that the DNS can > | accomodate any octets, but that [at least then] > | currently-specified RRs are restricted to ASCII. > > Sorry John, I can't fathom how you could possibly reach that > conclusion from what is in 1034 & 1035. See below. > > Eg: from 1035 (section 3.1) ... > > Although labels can contain any 8 bit values in octets that > make up a label, it is strongly recommended that labels > follow the preferred syntax described elsewhere in this memo, > which is compatible with existing host naming conventions. > > How much clearer do you want it? Yes, that is one of the "LDH rules are a good practice" sections I was referring to. And > 1034 is less clear, but (section 3.5 --- note its title) > > 3.5. Preferred name syntax > > The DNS specifications attempt to be as general as possible > in the rules for constructing domain names. The idea is > that the name of any existing object can be expressed as a > domain name with minimal changes. However, when assigning a > domain name for an object, the prudent user will select a > name which satisfies both the rules of the domain system and > any existing rules for the object, whether these rules are > published or implied by existing programs. > > "The prudent user will select" ... that is, this is a damn > good idea, but you don't have to do it if you know what you're > doing. And that is the other one. We are in complete agreement about this. > | The LDH rule is a good ("best"?) practices one. > > It is required (as updated) if the domain name is to be used > in an e-mail header (which back then, was almost the only > other formalised place that domain names appeared - other than > that was all OS specific command/arg stuff). Hmm. I'd argue that the other existing protocols count[ed] -- FTP and Telnet connection specifications, two-level specs in Finger and Whois, etc., but I think this isn't important. > | But the ASCII rule is a firm requirement. > > No it isn't, there is nothing at all which says that. See below. > | For evidence of this, temporarily ignore the text > > Hmm - ignore what is written, and attempt to infer from > something else... As I said, trying to take a shortcut and to provide some additional evidence of intent/logical consistency in case the sections that I think are important are read as contradicting those that you cite above. I don't read the spec as contradictory, but as specifying three sets of rules: LDH (recommended), ASCII (required for "existing" RRs), binary (possible future extensions in new RRs). > | (although, personally, I think it is clear -- especially > in 2.3.3-- | if read carefully) > > 2.3.3 is about character case, and I agree, that is a very > messy area indeed. > > | and examine > | the requirement that, for the defined RRs, labels and > queries be | compared in a case-insensitive way. > > Not quite. What it says that ascii labels (ones with the top > bit clear) must be handled that way, it carefully refrains > from saying what should be done in other cases - leaving that > for future definition (which is kind of what this recent work > has all been about). However, it clearly allows non-ascii > labels - it just doesn't specify what they mean, or how to > interpret them. That's what needed to be done to allow > non-ascii names to have some kind of meaning. More on this below (I'm going to paste in an earlier analysis, with the text cited, rather than screwing it up by trying to reconstruct) but I don't see anything in the text that says "if you see the high bit set, you can assume it is binary and other rules apply; if the high bit is zero, then it is ASCII and needs case-independent comparison". It seems to me that a statement of that general nature would be needed to justify your assertion above. I note with interest that even 2181 doesn't seem to include such a statement as a clarification of what is an "ascii label" and what is a "binary label". What it says instead (section 11) is Those restrictions aside, any binary string whatever can be used as the label of any resource record. Similarly, any binary string can serve as the value of any record that includes a domain name as some or all of its value (SOA, NS, MX, PTR, CNAME, and any others that may be added). and, from the abstract, where "the last two" refer to the canonical name issue and the valid contents of a labeL: The other two are already adequately specified, however the specifications seem to be sometimes ignored. We seek to reinforce the existing specifications. >From which I assume that 2181 did not intend to change anything about 1034/1035 in this area and that its approval by the IESG was based on that assumption. > | So I believe that the "future RRs" language with regard to > | binary labels in 1034 and 1035 must be taken seriously and > as | normative text: if new RRs (or new classes) are > defined, they | can be defined as binary and, > > Have you actually thought about what you have just said? > That is, the rules for naming the DNS tree depend upon the > data that is stored there? > > Do you seriously mean that? I think what I'm suggesting is that the valid content of a given label depends on the RR type (and Class) with which it is associated. One can question the wisdom of that in retrospect, but that it what the specification says. > Classes are a whole other mess, that no-one really seems to > understand, one of those "this might be a good idea" frills, > that is completely undefined. It isn't clear whether different > classes share the same namespace or not (just they they share > a few RR type definitions). Classes are essentially extinct. We could debate that too, but I agree that it does not seem important at this stage, except, perhaps, to understanding where binary labels might be used. > | hence, as not requiring > | case-insensitive comparisons. Conversely, within the > current | set (or at least the historical set at the time of > 1034/1035), | case-insensitive comparison is required and > hence binary must | not be permitted. > > Case insensitive comparison of ascii is required, what is done > with the rest is undefined. To make it meaningful it needs > to be defined, that I agree with. > > One easy (though perhaps not desirable, I don't know) solution > would be to simply restrict the case insensitive part, as far > as the DNS is concerned, to ascii only, so that A==a but Á!=á. > Eventually doing away with case insensitive for all labels > seems like a good idea to me. Of course, to support "case insensitivity for ASCII only", it would be nice to have an algorithmic rule for identifying ASCII. But binary labels can, in principle, have octets with the high bit clear, or even all octets with the high bit clear. And one does not want to apply case-insensitivity matching to binary labels, not matter how they are structured. So, I believe, in logic, that one needs to know, on a per-RR type (or, in principle, per-query-type or other per-query) basis, whether the comparison involves character comparison (hence case insensitive over at least some of the octets) or binary comparison (comparison of bits, no fussing). I can't comment on whether doing away with case insensitivity is a good idea, since I can argue either for of against it in new applications. But the transition from case insensitive comparison to case sensitive (or binary) comparison would be a very interesting exercise. > | Any other reading, I believe, leads immediately either to > | contradictions or to undefined states within the protocol. > > Undefined, yes. That's not unusual, lots of protocols have > undefined states. See below. > | As an aside, it appears to me that this requirement for > | case-insensitive comparison is the real problem with "just > put | UTF-8 in the DNS" approaches. > > Not really - what causes the problem is putting more than > ascii there. As soon as you permit that, you have to deal with > all of the issues. The way the bytes are encoded is > irrelevant. One way out of this is to require that the DNS > always use the "lower" case (whatever that happens to be in > any particular instance - that is, whenever multiple > characters are generally assumed to mean the same, pick one as > the one that must always be used within the DNS) and have the > resolver enforce it. Whether the data once chosen is encoded > in UTF-8 or some other way is irrelevant. Except that some of those "other ways" may result in octets with the high bit clear that do not represent ASCII characters (assuming, as you do, that 1034/1035 require case insensitive comparison for ASCII only). In other words, the DNS needs to know something about the encoding in order to know when to apply case insensitive comparison (and, potentially, how to do it). > The problem with doing this is that it requires every resolver > to be able to handle every possible case mapping (for any > domain that it may ever encounter - which is all of them, or > course). On the other hand, doing it in the server only > requires the server to understand the case folding rules for > the actual domain names it serves, not necessarily anyone > else's (back end caches have a problem either way of course) I think this is correct. While I haven't done the analysis, my intuition tells me that, if we are going to go down this path on the server side, we may have big problems potentially-recursive RRs like DNAME and NAPTR, but that is a separate problem. I hope. > In any case, these are the issues that a WG that was tasked > with defining how the DNS should treat non ascii labels should > be dealing with. Currently, there's none of that happening - > idn simply decided not to bother, and make everything inside > the DNS remain ascii forever. (Recently I have seen some > ramblings about long term conversion from ACE to UTF-8 inside > the DNS - that's a ludicrous prospect that can never happen). Yes. > | An existing and conforming > | implementation has no way to do those required > case-insensitive | comparisons outside the ASCII range. > > No, nor is it required to. There we probably disagree -- I suggest that the text is at least ambiguous and might require it. But, at some level, it isn't important, because the text clearly prohibits non-ASCII labels in "existing" RRs. See below. > | One supposes that we could modify the protocol to specify > that | case-insensitive comparisions be made only for octets > in the | ASCII range, but, unless that were done through an > EDNS option, | it would be a potentially fairly significant > retroactive change. > > That's not actually a modification, that's what is currently > required. Not my reading of sections you didn't cite. See below. That earlier analysis (slightly updated) and the text citations... [...] ... and that has led me to carefully re-read old text. That, in turn, leads to a question: it is very clear that nothing in the DNS spec requires the LDH rule, even though it appears as "prudent user" guidance in section 2.3.1 of RFC 1035 (and elsewhere). But it appears to me that binary labels are not permitted on the common RR types, for at least one technically-rational reason, and that 2181 glosses this over a bit. Specifically... >From RFC1034, section 3.1 By convention, domain names can be stored with arbitrary case, but domain name comparisons for all present domain functions are done in a case-insensitive manner, assuming an ASCII character set, and a high order zero bit. This means that you are free to create a node with label "A" or a node with label "a", but not both as brothers; you could refer to either using "a" or "A". When you receive a domain name or label, you should preserve its case. The rationale for this choice is that we may someday need to add full binary domain names for new services; existing services would not be changed. That statement is presumably part of your justification for assuming that all bets are off if the high order bit is on. Whether that is important depends on what "existing services" refers to, plus the problem of binary labels that don't happen to contain octets with the high bit set and how they are to be recognized and thence compared. and RFC1035: 2.3.3. Character Case For all parts of the DNS that are part of the official protocol, all comparisons between character strings (e.g., labels, domain names, etc.) are done in a case-insensitive manner. At present, this rule is in force throughout the domain system without exception. However, future additions beyond current usage may need to use the full binary octet capabilities in names, so attempts to store domain names in 7-bit ASCII or use of special bytes to terminate labels, etc., should be avoided. I'm inclined to read "additions beyond current usage" as implying new RRs or new Classes, you are inclined to read it as having octets with the high bit on appear in existing RRs. It seems to me that this is at least a bit ambiguous, rather than crystal-clear in the latter direction. More important, it appears to me to make a clear (and necessary) distinction between "character strings" and "full binary octet capabilities" in the DNS, to require case-insensitive comparison only for the former, and hence to require that one be able to tell the difference unambiguously. But the first part of this does say "For all parts of the DNS that are part of the official protocol, all comparisons between character strings ... are done in a case-insensitive manner." To emphasize, that is "all parts" and "all comparisons", not "unless you happen to find the high bit turned on". So, in the absence of some standards-track document that changes the comparison rule -- either for new RRs or retroactively for existing ones -- it seems to me that we are stuck with it. And that "However" sentence seems to apply to storage forms in implementations, not to what is permitted in labels or queries. [...] The requirement to do case-mapping is, I think, ultimately a restriction on the labels. It makes it hard for me to think about the interpretation of a binary label unless the label is specified as "binary" as part of the description of the associated RR. Indeed, given understanding we have gained with the IDN WG (which PVM probably didn't have when 1034/1035 and their predecessors were written), it makes it hard for me to think about anything but ASCII for anything but new RRs (or, potentially, classes). Moreover, the text of 1034/1035 appears to me to require ASCII labels for all RR types specified in those documents, and maybe even for all new RR types that don't explicitly specify binary labels. And, if after going through this, you find that we are still reading the text differently, I suggest that 2181 probably need updating to clarify how one of those "any binary string" labels are to be interpreted when they appear in queries that require case-insensitive matching. Otherwise, we have what appears to be a very strong statement about what is permitted with no specification at all about how it is handled if one appears. That doesn't seem to me to be the path to interoperability. john