> RE: New Last Call: 'Tags for Identifying Languages' to BCP > Date: 2004-12-10 20:03 > From: "Peter Constable" <petercon@xxxxxxxxxxxxx> > To: ietf@xxxxxxxx > CC: ietf-languages@xxxxxxxxxxxxx > > Resuming my comments: > > Specifically, the draft allows, and RFC 3066 disallows: > >  Âsubtags more than 8 octets in length > >  Âhyphens which do not separate subtags > >  Âzero-length subtags > >  Âprimary tags which are not purely alphabetic > > Curiously, all of those are permitted by the draft ABNF > > production "grandfathered"... > > The "grandfathered" production in the current draft is > > grandfathered  = ALPHA *(alphanum / "-") > > which does permit the sequences claimed by Bruce (except for > not-purely-alphabetic primary sub-tags), No exception. "alphanum" is ALPHA / DIGIT. In plain English, "grandfathered" as defined in the draft is a letter followed by any number of letters, digits, and/or hyphens, in any order. And that includes "a123-xyz" as I initially stated, and clearly 1, 2, and 3 are digits. > syntactically; but the set of > tags available for use is constrained by more than the ABNF syntax > alone: the acceptable productions for each sub-tag must either be taken > from one of the source standards or be registered. So what? The ABNF is an expression of the grammar that describes the set of all valid tags. If the grammar permits "y-----", "a123-xyz", etc. (and it does) then a parser claiming to parse language tags as defined by that ABNF must be able to parse such tags. That is, the ABNF- specified grammar imposes requirements on parsers. If one doesn't intend to impose such requirements, the ABNF specifying the grammar should be changed accordingly. > This is no different > from RFC 3066, so it is no more of a problem in this specification than > it was in RFC 3066. It is a very different grammar from RFC 3066, imposing very different requirements on parsers. > It might be that the wording in 2.2 could be tightened up to eliminate > any possible question regarding the source for "grandfathered" > productions. It's not a matter of wording; the problem is with the ABNF. > Alternately, there's no reason why the "grandfathered" production > shouldn't be composed exactly to match what was used in RFC 3066: > > grandfathered = 1*8ALPHA *("-" 1*8alphanum) I believe I said as much (though one then needs to look at reduce/reduce conflicts implied by the revised grammar): > > I see no reason for the ABNF to permit such content as is > > forbidden by RFC 3066; the actual ABNF for what RFC 3066 > > permits is contained within 3066, and could have been directly > > incorporated rather than producing a "grandfathered" > > production which opens up several cans of worms. > > This vastly overstates the problem. There is no can of worms unless it > exists in tags currently available under RFC 3066. I referred to the additional requirements imposed on parsers, as well as the unlimited tag length permitted. > > One defect related to tag length in RFC 3066 is not remedied > > by the draft; indeed the problem is greatly exacerbated... > > > Unfortunately, a language- tag's length is unlimited by > > the ABNF in RFC 3066 (due to an unlimited number of subtags) > > and in the draft... > > > In particular, tags other than private-use tags with more than > > two subtags require registration under RFC 3066 rules, and it > > is a trivial matter to determine the longest registered tag. > > The draft, however, encourages use of more subtags as well as > > removal of the subtag length upper bound; moreover, it permits > > infinite numbers of subtags without requiring registration of > > the resulting complete tag. > > Bruce states incorrectly that there is no upper bound on the length of > sub-tags. Look again at the draft definition of "grandfathered" -- now show me where there's a limit in that production on subtag length. > His other concern, on the overall length of complete tags, is > valid, however: in terms of the ABNF syntax for both RFC 3066 and RFC > 3066bis, infinite-length productions are possible, but RFC 3066 would > require registration of complete non-private-use tags while RFC 3066bis > does not. Yes, and a quick look at the registry reveals that the longest tag is 11 octets ("cel-gaulish"). > There are three open doors for infinite-length productions in the ABNF > of the current draft: > > - unlimited extlang sub-tags > - unlimited variant sub-tags > - the number of possible extensions is limited to 25 The ABNF indicates no such limit. > , but the length of > extensions is unlimited You have missed several others: 1. "privateuse" length is unlimited (either tacked on after "lang" etc., or directly as an alternative in "Language-Tag") 2. "grandfathered", which as already discussed permits unlimited length. > > We could impose some upper limits on these things; e.g. > > Language-Tag = ... *8("-" extlang) ... *8("-" variant) ... 1*25("-" > extension) I think you mean *25("-" extension), not 1*25... > extension = singleton 1*8("-" 2*8alphanum) That leaves the extension portions' length at up to 25 * (1 + 1 + 8 * 9) = 1850 octets, not taking any other parts of a tag into account! That's way too long (the RFC 2047 limit for an encoded-word is 75 octets, including charset tag, some text, and some syntactic glue in addition to the language tag). Heck, 1850 octets won't even fit into a maximum length RFC [2]821/[2]822 message line (998 octets). > If we also imposed limits on the length of private-use tags and defined > the grandfathered production in a way that made clear there was an upper > limit for those, then we could end up eliminating an issue that had > existed in RFC 3066. Perhaps; but you have a long way to go to get from 1850+ down to <64 octets. Even farther to get to something as reasonable as the current worst-case of 11 octets. > So, I think Bruce has identified a valid issue here. I personally would > not have characterized it as greatly exacerbating, though, IMO, an increase from 11 octets worst-case, which is tolerable for constructing RFC 2047/2231 encoded-words, to >> 1850 octets, which exceeds by a large margin what can be handled in a Content-Language or Accept-Language message header field, constitutes "greatly exacerbated". YMMV. [N.B. that ">>1850" takes into account your proposed restrictions which are not present in the draft] > as the issue > was present in RFC 3066: private-use tags did not need to be registered > in RFC 3066, so there was no way in implementation could be written with > certain knowledge that tags beyond some given length would not be > encountered. True, but: A. implementation is only one issue; protocol design (encoded- words and message header fields, for example) is a more important issue B. private-use tags require end-to-end cooperation as a prerequisite; given such cooperation, agreement can be reached on tag length C. Per some readings of BCP 82, not only are implementations not required to support experimental/private-use values, they are expected to erect barriers to their use, requiring users to specifically enable use of experimental/private-use functionality. > > I am absolutely shocked that a draft dealing with language > > lacks an "Internationalization considerations" section as > > recommended by RFC 2277 (a.k.a. BCP 18). > > No more or less shocking than for RFC 3066, regarding which I'm not > aware of any complaints. By deferring to the bilingual ISO lists for language and country tags, 3066 at least provided a minimal degree of internationalization. By explicitly limiting description fields to English and restricting the charset to US-ASCII, the draft proposal takes a giant leap backwards. > I don't quite understand what the critique is here: what is there to > internationalize about language tags? There should probably be a reference (at least informative) pointing to BCP 18 and mentioning that the language tags defined provide a means of labeling the language of text, when combined with other mechanisms (RFC 2047/2231 encoded-words, Content-Language fields, etc.), to implement the BCP 18 requirement for language tagging. The draft (if/when approved) should also indicate that it updates BCP 18, which refers to RFC 1766. Given the divergence noted above from RFC 3066's use of multilingual reference lists, the Internationalization considerations section should include a synopsis of the approach chosen (viz. to restrict description to English) and the rationale for that choice (see BCP 18 section 6). [Conversely the difficulty in writing a convincing rationale might prompt some effort into producing a less Anglo-centric design.] > It's > true that ALPHA and DIGIT are not defined Non-sequitur aside, those terms are defined in RFC 2234. > > Perhaps even more disturbing is the content of the "IANA > > Considerations" section; the draft predicts that certain things > > will happen ("IANA will"[...]), but doesn't actually direct > > (e.g. "IANA shall") IANA to do anything. ÂThe placement of that > > section does not correspond to current RFC-Editor guidelines > > (it should appear after Security Considerations); also on that > > point, Appendices should precede References. > > There is a process issue here, but I have assumed that the authors have > dealt with IANA on that. Otherwise, these are editorial issues -- "even > more disturbing" seems to me to be somewhat overstated. The words "will" and "shall" have very distinct meanings. If one expects IANA to take specific action, it would be advisable to clearly specify that IANA shall do so, rather than merely expressing the hope that IANA will do so. > > Many of the references are obsolete (e.g. RFCs 1327, > > 1521)... and at least one reference ([19]) > > gives a bracketed URI rather than the correctly formatted > > RFC reference. The RFC-Editor provides an "rfc-ref.txt" file containing the preferred citations. That file contains an "Obsoleted By" column that points authors to the current RFC. This isn't rocket science... > In fairness to the authors, page-oriented plain text is not exactly > conducive to authoring and revising a long document, There's no requirement to author in final publication form. In fact the original RFC Editor has provided guidelines and suggestions in the form of RFC 2233, discussing methods that have been used successfully in publishing quite long documents (textbooks!). The current RFC-Editor staff has a draft update. > >   implications (ISO 8601 date format parsing). > > As mentioned above, this really is a non-issue. It's an issue (esp. in light of the finger pointing regarding accessibility to ISO 639/3166). Admittedly it can be resolved without much difficulty (but then that could have been done earlier, couldn't it?). > > 2. the clear contradiction between the claims about > >   ABNF compatibility with RFC 3066 and the factual > >   incompatibility of certain provisions in the grammar. > > The main concern was with the "grandfathered" production, but I've shown > that that is a non-issue. Again, it is an issue that imposes requirements on language tag parsers. What you've shown is that the ABNF is not consistent with what was desired to be expressed, and that makes it an issue that needs to be addressed. > The maximal length issue exists just as much > in RFC 3066 due to private-use tags; it is a technical concern that > might worth reviewing in RFC 3066bis, however; but it is not > insurmountable, and not a new problem. Private-use carries its own considerable baggage; aside from that, the draft proposal increases the length of non-private tags that affect both protocol design and implementations from a worst case maximum of 11 octets under RFC 3066 registered tags to an infinite length, which is unworkable for existing Standards Track protocols (RFC 2822 at Proposed, RFC 2047 at Draft, and RFC 822 at Full Standard, to name a few). _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf