> From: John C Klensin [mailto:john-ietf@xxxxxxx] > Ignoring whether "that very nearly happened in RFC 3066", > because some of us would have taken exception to inserting a > script mechanism then, let's assume that 3066 can be > characterized as a language-locale standard (with some funny > exceptions and edge cases) and that the new proposal could > similarly be characterized as a language-locale-script standard I can see we might run into some terminological hurdles here. I would decidedly *not* describe RFC 3066 as a "locale" standard just because it allows for tags that include country identifiers. I would strongly contend that a "language" tag and a "locale" ID are different things serving quite different purposes. But I'll read the rest of your comments assuming that by "language-locale(-script) standard" you simply mean a standard for language tags that can include subtags for region and script. > If one makes that assumption, then > the (or a) framework for the answer to the question of what > problem this solves that 3066 does not becomes clear: it meets > the needs of when a language-locale-script specification is > needed. > > But that takes us immediately to the comments Ned and I seem to > be making, characterized especially by Ned's "sweet spot" > remark. It has not been demonstrated that Internet > interoperability generally, and the settings in which 3066 are > now used in particular, require a language-local-script set of > distinctions. I disagree. There are many cases in which script distinctions in language tags have been recognized as being needed; several such tags have been registered for that reason already under the terms of RFC 3066, and there are more that would already have been registered except for the fact that people have been anticipating acceptance of this proposed revision. (For instance, in response to recent discussions, a representative of Reuters has indicated that he was holding off registering various language tags that include ISO 15924 script IDs on that basis, and that he plans to do so if this proposed revision is delayed much longer.) > The document does not address that issue. That is probably because those of us who have been participants of the IETF-language list, where this draft originated, have become so familiar with the need that it seems obvious -- evidently, it's not as obvious to people that have not been as focused on IT-globalization issues as we have. > Equally important, but just as one example, in the MIME context > (just one use of 3066, but a significant one), we've got a > "charset" parameter as well as a "language" one. There are > some odd new error cases if script is incorporated into > "language" as an explicit component but is not supported in the > relevant "charset". On the one hand, the document does not > address those issues and that is, IMO, a problem. But, on the > other, no matter how they are addressed, the level of complexity > goes up significantly. I don't see how such error cases are significantly different from current possibilities, such as having a language tag of "hi" and a charset of ISO 8859-1 (where the content is actually uses some non-standard encoding for Devanagari). > One can also raise questions as to whether, if script > specifications are really needed, those should reasonably be > qualifiers or parameters associated with "charset" or "language" > (and which one) rather than incorporated into the latter. I > don't have any idea what the answer to those questions ought to > be. Having worked on these particular issues for several years, I and many others feel we *do* have an idea what the answer to those questions ought to be -- that script should be incorporated as a permitted subtag within a language tag. > But they are fairly subtle, the document doesn't address > them (at least as far as I can tell), and I see no way to get to > answers to them without a lot more specificity about what real > internetworking or interoperability problem you are trying to > solve. Some days ago, I made reference to a white paper I wrote a few years ago that explores the kinds of distinctions that need to be made in metadata elements declaring linguistic attributes of information objects. It's long, and there are some details I'd change, but that may provide a starting point. The people who have contributed to this draft are all familiar with these ideas. You can find this paper at http://www.sil.org/silewp/abstract.asp?ref=2002-003. Granted, this paper does not go into details regarding specific implementations. > Similarly, as we know, painfully, from other > internationalization efforts, the only comparisons that are easy > involve bit-string identity. Working out, at an application > level, when two "languages" under the 3066 system are close > enough that the differences can be ignored for practical > purposes is quite uncomfortable. Attempting similar logic for > this new proposal is mind-boggling, especially if one begins to > contemplate comparison of a language-locale specification with a > language-script one -- a situation that I believe from reading > the spec is easily possible. RFC 3066 makes reference to a fairly simplistic matching algorithm using the notion of language-range. The proposed draft would continue to support that same algorithm with an expectation that implementations of language-range matching as defined in RFC 3066 would continue to operate using exactly the same algorithm on new tags permitted by the proposed revision -- and with generally desirable results. There may be implementations that use a more complex approach to matching involving inspection of the tagged content itself, or inspecting the particular subtags of a language tag. Certainly an existing RFC 3066 implementation that does the latter will not be aware of the specific syntax of the proposed revision, though it also cannot be aware of registered RFC 3066 tags defined after the implementation was created -- there is no categorical difference here. As for how difficult it would be to update such an implementation to use a sophisticated matching algorithm based on interpretation of individual subtags permitted by this draft, I grant that there is greater complexity, but the draft specifically imposes syntactic constraints that allow different types of sub-elements to be identified quite readily. As for how the different sub-elements would be used for matching, for instance in recognizing a relationship between a language-region tag and a language-script tag, those are issues that already exist with valid RFC 3066 tags such as zh-CN and zh-Hans. I agree that it is not a trivial matter to decide exactly how such tags relate. That does not, however, change the fact that language tags that incorporate script IDs are useful and appropriate; for instance, in this particular example, all that was available for tagging Chinese content for some time were tags like zh-CN and zh-TW, and this was causing very significant problems for implementations and users, which is precisely why zh-Hans and zh-Hant have been registered, and why many of us are eager to see a revision of RFC 3066 that incorporates script IDs. (Granted, that does not speak to other changes proposed by the draft.) > That situation almost invites > profiling of how this specification should be used in different > circumstances... I have no particular counter to the opinions you expressed in your remaining comments. Peter Constable _______________________________________________ Ietf@xxxxxxxx https://www1.ietf.org/mailman/listinfo/ietf