RE: Last Call on Language Tags (RE: draft-phillips-langtags-08)

John C Klensin <john-ietf@xxxxxxx> · Mon, 03 Jan 2005 19:47:02 -0500

--On Monday, 03 January, 2005 12:29 -0800 Peter Constable
<petercon@xxxxxxxxxxxxx> wrote:

>> From: John C Klensin [mailto:john-ietf@xxxxxxx]
> 
>> Ignoring whether "that very nearly happened in RFC 3066",
>> because some of us would have taken exception to inserting a
>> script mechanism then, let's assume that 3066 can be
>> characterized as a language-locale standard (with some funny
>> exceptions and edge cases) and that the new proposal could
>> similarly be characterized as a language-locale-script
>> standard
> 
> I can see we might run into some terminological hurdles here.
> I would decidedly *not* describe RFC 3066 as a "locale"
> standard just because it allows for tags that include country
> identifiers. I would strongly contend that a "language" tag
> and a "locale" ID are different things serving quite different
> purposes. But I'll read the rest of your comments assuming
> that by "language-locale(-script) standard" you simply mean a
> standard for language tags that can include subtags for region
> and script.

That is more than close enough for discussion purposes.

>> If one makes that assumption, then
>> the (or a) framework for the answer to the question of what
>> problem this solves that 3066 does not becomes clear: it meets
>> the needs of when a language-locale-script specification is
>> needed.
>> 
>> But that takes us immediately to the comments Ned and I seem
>> to be making, characterized especially by Ned's "sweet spot"
>> remark.  It has not been demonstrated that Internet
>> interoperability generally, and the settings in which 3066 are
>> now used in particular, require a language-local-script set of
>> distinctions.
> 
> I disagree. There are many cases in which script distinctions
> in language tags have been recognized as being needed; several
> such tags have been registered for that reason already under
> the terms of RFC 3066, and there are more that would already
> have been registered except for the fact that people have been
> anticipating acceptance of this proposed revision. (For
> instance, in response to recent discussions, a representative
> of Reuters has indicated that he was holding off registering
> various language tags that include ISO 15924 script IDs on
> that basis, and that he plans to do so if this proposed
> revision is delayed much longer.)

It would be very helpful, to me at least, if you or he could
identify the specific context in which such tags would be used
and are required.  The examples should ideally be of
IETF-standard software, not proprietary products.

>> The document does not address that issue.
> 
> That is probably because those of us who have been
> participants of the IETF-language list, where this draft
> originated, have become so familiar with the need that it
> seems obvious -- evidently, it's not as obvious to people that
> have not been as focused on IT-globalization issues as we have.

How nice.  In 2004, I discovered that I had no operational
experience and then that I knew nothing about standardization
processes outside the IETF.  It is now only three days into 2005
and already I've learned that I haven't been focused on "IT
globalization".  I anxiously await the opportunity to find out
what comes next in this sequence :-)

>> Equally important, but just as one example, in the MIME
>> context (just one use of 3066, but a significant one), we've
>> got a "charset" parameter as well as a "language" one.
>> There are some odd new error cases if script is incorporated
>> into "language" as an explicit component but is not supported
>> in the relevant "charset".  On the one hand, the document
>> does not address those issues and that is, IMO, a problem.
>> But, on the other, no matter how they are addressed, the
>> level of complexity goes up significantly.
> 
> I don't see how such error cases are significantly different
> from current possibilities, such as having a language tag of
> "hi" and a charset of ISO 8859-1 (where the content is
> actually uses some non-standard encoding for Devanagari).

Since I haven't paid attention to IT globalization and
internationalization issues for the last 20 or 30 years, I
obviously don't know enough about alphabetic equivalency
relationships, the collection of TC 46 transliteration standards
(including, in this case, the possibility that IS 15919 is in
use), and related work to be able to address this question.

>> One can also raise questions as to whether, if script
>> specifications are really needed, those should reasonably be
>> qualifiers or parameters associated with "charset" or
>> "language" (and which one) rather than incorporated into the
>> latter.  I don't have any idea what the answer to those
>> questions ought to be.
> 
> Having worked on these particular issues for several years, I
> and many others feel we *do* have an idea what the answer to
> those questions ought to be -- that script should be
> incorporated as a permitted subtag within a language tag.

Good.  See request for explanation and examples above.  Things
that you and your colleagues know, but that aren't in the draft
or some supplemental and equally accessible document are really
not helpful.

>> But they are fairly subtle, the document doesn't address
>> them (at least as far as I can tell), and I see no way to get
>> to answers to them without a lot more specificity about what
>> real internetworking or interoperability problem you are
>> trying to solve.
> 
> Some days ago, I made reference to a white paper I wrote a few
> years ago that explores the kinds of distinctions that need to
> be made in metadata elements declaring linguistic attributes
> of information objects. It's long, and there are some details
> I'd change, but that may provide a starting point. The people
> who have contributed to this draft are all familiar with these
> ideas. You can find this paper at
> http://www.sil.org/silewp/abstract.asp?ref=2002-003. Granted,
> this paper does not go into details regarding specific
> implementations.

I've just now skimmed parts of this paper.  It is very
interesting and I look forward to carefully reading the rest of
it.  We are in agreement about your category model.   The only
place where there is a difference is whether, for the purposes
of the IETF and the actual demands on RFC 3066, something else
--and something as complex as I perceive this proposal as
being-- is really needed.   I can, for the record, believe that
this proposal is unnecessary and too complex while also
believing that it is possible to make registrations under the
rules of 3066 that would make quite a mess of things.   We have
tag review processes to prevent just that eventuality.  I can
also believe that 3066 represents a compromise, rather than a
perfect solution to the issues you outline in your paper,
without believing that translates into either a problem that
needs to be solved or a problem that needs to be solved with
this particular proposal.  I've got a fairly open mind on those
subjects; I just believe that the burden of demonstrating that a
major change is needed in a system that appears to be working
is, and should be, fairly high.

>> Similarly, as we know, painfully, from other
>> internationalization efforts, the only comparisons that are
>> easy involve bit-string identity.  Working out, at an
>> application level, when two "languages" under the 3066 system
>> are close enough that the differences can be ignored for
>> practical purposes is quite uncomfortable.   Attempting
>> similar logic for this new proposal is mind-boggling,
>> especially if one begins to contemplate comparison of a
>> language-locale specification with a language-script one -- a
>> situation that I believe from reading the spec is easily
>> possible.
> 
> RFC 3066 makes reference to a fairly simplistic matching
> algorithm using the notion of language-range. The proposed
> draft would continue to support that same algorithm with an
> expectation that implementations of language-range matching as
> defined in RFC 3066 would continue to operate using exactly
> the same algorithm on new tags permitted by the proposed
> revision -- and with generally desirable results. 
> 
> There may be implementations that use a more complex approach
> to matching involving inspection of the tagged content itself,
> or inspecting the particular subtags of a language tag.
>...

Peter, you are talking, I think, about different applications
doing different things given the greater range of options and
flexibility that the new specification provides.  From my point
of view and experience, every time someone says "well, some
applications may do something else" or "some implementations may
use a more complex approach", what I hear is more potential for
ways in which things won't interoperate, more areas in which
profiles are needed to assure interoperability, and so on.
Whether the interoperability issues show up at a protocol level
or to the user as a violation of the law of least astonishment
makes little difference: such things make the Internet work less
well and should be avoided unless there is a _really_ strong
reason for them.  What I'm trying to probe here are those
reasons.

>...

Let me also comment on the ISO 3166 issues here, rather than
starting another note.  For me, there is no question that
3166/MA has made quite a mess of things with a few of their
reuse decisions, most notably the recent assignment of CS to
Serbia and Montenegro.  In the pre-ICANN period, IANA had fairly
well considered procedures for dealing with code changes and I
have been appalled that ICANN has sometimes felt a need to
ignore those precedents in favor of believing that it needs to
consider ccTLD changes any time 3166/MA makes a change.   But
the solution to the problem of various ISO TCs not having an
adequate understanding of the impact on the Internet and IT
communities (and, in the case of TC46, even the
library/information sciences community that are one of their
historical main constituencies) is, IMO, to get that message
across via liaison statements and, if necessary and appropriate,
encouraging national member bodies to cast "no" votes on
standards and registration procedures that are insufficiently
stable.  After the "CS" decision, the statements from the
British Library advocating a much longer time-to-reuse and from
the IAB suggesting that a century might be adequate were, again,
IMO, just the right sort of approach.   In particular, I presume
that TC 37 has an adequate liaison mechanism in place with TC 46
to insist that a much more conservative position be adopted with
regard to changes.  If TC 37 isn't able or inclined to do that
job effectively, I'm not persuaded that shifting the task to the
IETF is an appropriate solution or one that is likely to be
effective.

As I have noted in other contexts, an attitude in the Internet
community that extreme stability in external standards is
critical is not a new development as evidenced in our continued
use of ANSI/X3.4-1968 as the base reference for "US-ASCII", just
as our response to some incompatible changes in Unicode between
3.2 and 4.0 has been to freeze some things at 3.2.  Our solution
has not been to try to create IETF standards to work around the
stability issues ISO (or other) standards.  Down that path
generally lies madness.  If it is really necessary --i.e., there
are no other practical alternatives and we have the needed
expertise-- then I think we should consider it, but that case
has, IMO, not yet been made in this case.

My apologies but, since the Last Call is closing and there is
supposed to be a -09 coming, I don't believe that it is useful
to continue this discussion much further until the IESG has made
some decisions about what should be done next and told the
community about them.

    john

_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf