> Date: 2004-12-13 01:05 > From: "Peter Constable" <petercon@xxxxxxxxxxxxx> > To: ietf-languages@xxxxxxxxxxxxx, ietf@xxxxxxxx > RFC 3066 does not impose any restrictions on what its replacements might > do. This is the case with any specification: a given technical > specification is not a specification of human behaviour and cannot keep > us from revising the spec or replacing it in any way we may choose. It's not clear exactly who is meant by "us", but I'll leave that to a separate message. It is considered bad practice for a document which obsoletes another document to depend on the obsoleted document for definitions or other interpretation of the meaning of what is contained in the successor document. > You have mentioned conflict with RFCs 2047 and 2231. RFC 2047 does not > make reference to language tags. The ABNF of RFC 2231 does not impose > any limit on the length of language tags. RFC does contain an implicit > length issue in that it updates RFC 2047, allowing language tags within > encoded words, but it does not explicitly identify any upper bound on > the length of language tags. By reading both RFC 2047 and RFC 2231, one > finds that they assume that a language tag must be at most 64 characters > long: You have missed several important and not-so-subtle points. One of which is that RFC 2231 explicitly amends RFC 2047; it clearly so states in the first page heading and in the text, and is also indicated in the RFC Index. Another is that neither uses ABNF; both use EBNF as defined in RFC 822. More details on specific missed points below: > - the shortest charset names are 2 characters long (e.g. "IT") Not all charsets have 2-character names. Not all two-character names which might be assigned are suitable for MIME use. Where a preferred MIME name is indicated, that should be used. > - the minimum encoded-text length is 1 character long That is strictly only true for text that meets all of the following conditions: a) is representable in a specified subset of ANSI X3.4, and therefore requires no encoding b) does not use any encoding, even if unnecessary c) does not use a charset and character sequence involving shift sequences (e.g. as in ISO 2022-like charsets) It also misses the point that using 76+ octets to represent a single octet is rather wasteful. Any use of B encoding will require a multiple of 4 octets of encoded text. Q encoding has some special cases, but typically requires 3 octets or more. > An encoded-word must contain at least 11 characters that are not part of > the language tag and have a total length of no more than 75 characters. > Therefore, an upper bound on language tags that can be used in an RFC > 2047/2231 encoded-word production is 64 characters. That is a best case upper bound, for text which requires no encoding at all, one character per encoded-word. > In many cases, where > the charset tag or encoding is longer, the upper bound on the length of > languages tags will be less, but the RFC gives no estimate or indication > of how much less. The worst case appears to be the charset named Extended_UNIX_Code_Fixed_Width_for_Japanese (43 characters), which in fact uses ISO 2022-like sequences. That is the primary name for that charset; there is no preferred MIME alias, and the only other alias is the one specified for printer MIB use. Shifted characters are represented by two octets, each of which requires encoding. The shift sequences are 3 octets each, and RFC 2047 requires that an encoded-word start and begin in unshifted state. Therefore the minimum amount of encoded text for a single character in a shifted subset consists of an encoding of: a 3 octet shift sequence (one of which requires encoding), 2 octets representing the single character (both requiring encoding), and 3 octets restoring the unshifted state (one requiring encoding). Using B encoding results in 12 octets of encoded text as a minimum (Q-encoding would require a minimum of 16 octets). So a single character in a shifted subset of that particular charset, using B encoding, leaves at most 12 octets for a language-tag. As mentioned, use of an encoded-word plus the necessary whitespace around it to represent a single character is rather wasteful, so a brief language tag is indicated; fortunately "ja" suffices for text likely to be used with that charset. > This is a constraint on an application of RFC 3066; it is not a > constraint on RFC 3066 itself. It is possible that other applications of > RFC 3066 may impose limits that may be longer or shorter than that > imposed by RFC 2047/2231. Yes, and it is sometimes desirable to transfer text and tag from one application to another. For example, text in the body of a message can have language indicated by a Content-Language header field, where there is up to 997 octets available for a language tag. However a response regarding some portion of that message might well indicate the topic of the response in the response message's Subject field, where encoded-word limits apply. > I see no reason why limits must be added as a > constraint in a revision of RFC 3066. The primary reason for specifying limits is due to the proposed removal of the review/registration process which currently limits the length of non-private-use tags. > It would be a good idea, however, > to point out in section 2.1 of the draft that some applications of this > specification may impose limits on the length of accepted language tags, > and perhaps to cite RFC 2231 as an example. As a general principle, that's fine, however I would point out that given the inability of experts to be able to accurately point out the limits quickly (I neglected the shift sequence constraints in an earlier analysis, and Peter missed several points about encoded text etc.), I do not think it is sufficient merely to state the fact that there are limits, with or without a pointer to RFC 2231 as an example. Some indication of the magnitude of worst-case restrictions is at least advisable, and it is necessary to point out that generous limits imposed by a particular portion of a protocol, coupled with reuse of the text and tag in a different portion of that protocol or in a different protocol, may impose shorter limits that are not readily apparent from consideration of only a subset of any single protocol. _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf