> Date: 2005-01-02 19:47 > From: "Addison Phillips [wM]" <aphillips@xxxxxxxxxxxxxx> > > > It would be entirely possible for "en-Latn-US-boont" to be > > registered under the terms of RFC 3066. > > > > But it hasn't been. No RFC 3066 parser will therefore find > > that complete tag in its list of IANA registered tags, nor > > will it be able to interpret "Latn" as an ISO 3166 2-letter > > country code. > > RFC 3066 parsers already should not interpret "Latn" as an ISO 3166 region code. It isn't two letters long. Correct. The point being that "en-Latn-US-boont" is neither a registered IANA tag nor a tag with the first two subtags as specified by RFC 3066, and is therefore not a *valid* language tag. If I use such a string in place of a language-tag in an RFC 2047 encoded-word-like construct and feed it to a validating parser, I am informed that: 1. there is no valid language-tag 2. that a language tag having 3-8 characters in the second subtag would have to be registered 3. that RFC 1958 section 3.12 specifies use of registered names 4. that RFC 2047 requires that a sequence beginning with =? and ending with ?= is required to be a valid encoded-word (which is not the case due to an invalid language-tag-like string where a valid language-tag is supposed to appear) Now you might say that that is a pedantic interpretation of the respective RFCs, and you'd be right -- that's the point of a validating parser. You might then ask "what about a non- validating parser?" My brief answer would be that results would be in general unpredictable (which is to say that there is a high risk of failure to interoperate). More specifically, there are a number of things that an individual implementation might or might not do. It might or might not try to decode the alleged encoded-word for presentation (bear in mind in this and the following discussion that "presentation" might include a screen reader (text-to-speech) for the visually impaired). If it does not, the raw characters comprising the string will be presented; not necessarily intelligible, particularly to a layman who lacks detailed knowledge of RFC 2047 and of language-tags (which Peter has told us are not meant to be seen by mere mortals). If it elects to attempt presentation, it may need to decide what language to use (particularly, as noted, for screen readers). It might in that case use the longest left-most portion which is recognizable as a comprehensible (i.e. having a defined meaning) language tag, which in this case is simply "en" (remember, we're talking about RFC 3066 parsers, and "en-Latn" is neither registered nor comprised of language code plus country code). I will leave to your judgment whether or not something in en-[Latn-]US-boont is likely to be intelligible to a listener when presented as if it were generic en, noting that we have already had a discussion about directionality of specificity of language tags -- and in this case if the listener has any indication of the specified language, it will be what the parser can determine, viz. (plain) English. > As for RFC 3066 parsers being unable to interpret the tag, what do you think happens now? New tags are registered all the time and these don't appear in the putative list of tags inside extant RFC 3066 parsers. The parsers don't know what the tag means, but that doesn't invalidate its use for content in that language or by end users, now does it? > > For a concrete example, think about "sl-rozaj", just over a year old. None of the browsers in my browser collection, not even Firefox, knows what that tag means, but all of them accept it and emit it in my Accept-Language header and no web sites have complained about it. Okay, I'm not getting any Resian content back (but then it isn't first in my A-L list either). [...] > An RFC 3066 parser has no way of recognizing a tag registered after the parser's list of tags was created. Therefore RFC 3066 parsers do not, as a rule, reject unknown tags. Making sense of a tag is subjective in the case of generative tags today in any event. The level of sense required of an RFC 3066 parser is generally that it be able to use the remove-from-right matching rule on ranges and tags until if finds a value it "knows". [...] > No, a strict RFC 3066 parser has to have an up-to-the-second list of registered tags. Unless you've just written an implementation that foolishly does it, no implementations reject unknown tags as long as the tags fit the ABNF requirements of RFC 3066. Draft-langtags utilitizes this fact to its advantage and actually tidies things up a bit. Tags are registered relatively infrequently; none have been added in the last 6 months. You are quite correct that there is an issue regarding updates to a registry, however: 1. that applies also to your proposal to register subtags; new entries won't be known to validating parsers until said parsers are updated 2. it is not unique to language-tags; for example, MIME application subtypes seem to be added at a fast and furious pace, certainly much more frequently than language-tags 3. as use (at least theoretically) does not begin until after registration, the issue under the current arrangement isn't so bad (and wouldn't be so bad under the proposal but for the existence of an installed base that has no built-in knowledge of the "4-characters means script", etc. rules which are not present in RFC 3066 and predecessor). 4. because there is a need to be able to validate and use registered tags when off-line, there seems to be no general solution to the problem, particularly as the number of items in the registry would vastly increase under the proposed draft > > Language is not exclusively associated with text. ÂIt is also a > > characteristic of spoken (sung, etc.) material (but script is > > not). > > Yes, I agree. Script is important to textual applications of language tags, though. The fact that it is not applicable to aural or otherwise signed representations of language has nothing to do with whether scripts might need to be indicated on content that is written. > > > Note my use of "or" not "and". ÂI certainly did not state that the > > information could be obtained from charset alone in all cases. > > Groping the text is a very poor mechanism for determining the writing system used. Your suggestion is that we *should* be *forced* to grope the text. It also appears to be your position that we should *not* be given a mechanism whereby users can indicate a script preference when selecting language content. Let me be clear; I am not in any way suggesting that indication of script or other characteristics peculiar to written material should be prohibited, nor am I suggesting that applications "*should* be *forced* to grope the text" for such indication. Rather I am suggesting that an orthogonal mechanism might be used such that it can be applied to written text without interfering with non-textual media; I have unofficially suggested a hypothetical Content-Script field as one possible approach, and have noted that the existing mechanisms for indicating charset may be adequate in many cases for determining script (e.g. text with a charset of ANSI_X3.4-1968 is certainly Latin script, not Cyrillic, and the reverse is true of text with a charset of INIS-cyrillic). While "groping the text" would certainly be a poor choice for large texts (e.g. a message body) it might be appropriate in circumstances where the amount of text is strictly limited to a small chunk, and the real estate for a language-tag is also strictly limited; the poster child would seem to be an IDN label which (prefix, tag, plus Cuisinart-processed utf-8 name) has to fit into 63 octets. > > The analogous way to handle that in Internet protocols would be > > via Content-Script and Accept-Script where relevant (which they > > would not be for audio media). > > I think that's an awful idea. Why should users have to set two headers to get one result? Users typically don't set header fields in an HTTP transaction (for example) any more than they set bits in an IP packet header; that's handled by user agents or other protocol-handling entities as a matter of communicating information between agents according to a protocol. Well-designed protocols transfer pieces of orthogonal information by mechanisms which provide for handling those pieces of information individually and which therefore do not burden entities with having to process unnecessary data. As an example of an application where having the information separate is important, consider a web search where one is willing to accept results in Serbian language as used in Serbia and Montenegro in any media (video+audio, audio only, text), but where one wishes to restrict text results to Latin script only. Specifying language and script separately permits return of audio and video content matching "sr-CS" as well as text which matches both language and script as specified. Specifying instead that results must match both language plus script according to the proposed draft syntax and matching rules (viz. "sr-Latn-CS") would necessarily exclude non-textual media unless such media were inappropriately labeled with script (we are agreed that such labeling would be inappropriate). _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf