RE: Last Call on Language Tags (RE: draft-phillips-langtags-08)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



John:

> How nice.  In 2004, I discovered that I had no operational
> experience and then that I knew nothing about standardization
> processes outside the IETF.  It is now only three days into 2005
> and already I've learned that I haven't been focused on "IT
> globalization".  I anxiously await the opportunity to find out
> what comes next in this sequence :-)

I did not mean to imply that you have no particular involvement in IT
globalization, though I can see now that that is a likely way for my
comment to have been read, and for that I apologize. 

For the past several years the majority of my work has been related to
standards pertaining to IT globalization in one way or another, and I
have encountered a few nexus of people interested in metadata elements
for describing linguistic properties of content; a number of the people
I have encountered in these contexts have congregated (metaphorically)
on the IETF-languages list, and a number of those have provided input on
this draft. In each of these contexts, I have encountered general
agreement with the idea that it is appropriate to include writing-system
distinction as part of language tags; after some time, it has only been
in the past couple of weeks that I have encountered people who have
questioned the decision to incorporate script IDs, and all of these have
been people who have not been subscribed to the IETF-languages list, or
at least have not been active contributors to discussion on that list.


> It would be very helpful, to me at least, if you or he could
> identify the specific context in which such tags would be used
> and are required.  The examples should ideally be of
> IETF-standard software, not proprietary products.

Chinese can be used as a good example for writing-system distinctions
that cannot be captured in RFC 3066 using pre-defined values. A good
example of an IETF protocol where such distinctions are needed is the
accept-language field of HTTP page requests. For several years, Web site
administrators encountered difficulties in providing appropriate choices
to users using control mechanisms involving tags like zh-CN and zh-TW --
the tags simply did not correspond well with the localized content that
they were needing to provide to users.

Certainly outside of IETF protocols there are lots of scenarios not
involving proprietary products in which RFC 3066 language tags are used
and in which script distinctions like the Chinese example are a
significant issue. This is a big issue for the localization industry,
for example, and its various data-representation standards based on XML.
More generally, there are a growing number of XML-based specifications
for language resources and content, in many of which text is a major
form of language data, and in all of these cases writing-system
distinctions like those for Chinese are critically important.

I've used Chinese as one example, but there are many other cases, some
familiar to many and some less well known. Also, in relation to IETF
protocols, I mentioned only HTTP, but the same problems likely exist for
other protocol involving textual linguistic content where RFC 3066 is
used. For example, in searching for items in an LDAP directory, in may
be appropriate for an AttributeDescription to specify Tradition Chinese
rather than Simplified Chinese, or Serbian using the Latin-script
orthography vs. Serbian using the Cyrillic-based orthography.



> I've just now skimmed parts of this paper.  It is very
> interesting and I look forward to carefully reading the rest of
> it.  We are in agreement about your category model.   The only
> place where there is a difference is whether, for the purposes
> of the IETF and the actual demands on RFC 3066, something else
> --and something as complex as I perceive this proposal as
> being-- is really needed.

In ideal terms, I do not think that all of the complexity of the
proposed draft is needed. On the other hand, I think that some people's
characterization of the excessive complexity has been overstated, some
of the complexity I consider superfluous but not particularly harmful
(notably the extensions), and some of the complexity I think is an
unfortunate result of existing implementations and past practice (in
particular, the steps taken to avoid instability of ISO 3166 and the use
of both UN numeric IDs and ISO 3166 due to the combination of prior
usage of ISO 3166-1 together with the need for region identifiers other
than those provided by ISO 3166-1).


> I can, for the record, believe that
> this proposal is unnecessary and too complex

Strictly speaking, any tag it proposes could be registered using the RFC
3066 registration process, so it could in some sense be claimed to be
unnecessary. But there is no reason why not to allow generative
combinations involving script IDs where such tags are needed since
there's no need to state the semantics of the whole explicitly in such
cases. And there *is* a need to avoid the problem you alluded to...

> while also
> believing that it is possible to make registrations under the
> rules of 3066 that would make quite a mess of things.

Part of my reluctance to have script IDs included in RFC 3066 was due to
the fact that a set of tags had just been registered (some of which I
now wish didn't exist) which used various subtags in combination, and I
sensed that there was a lack of collective understanding of what the
internal structure of tags and relationships between subtags should be
(which is a direct cause that led me to write the paper I referred to
earlier). Not long after RFC 3066 was approved, there were several
further tags registered that used various subtags in combinations that
concerned me then (I voiced my reservations at the time) and still do.
RFC 3066 *is* too flexible to use without some kind of constraints.
While the proposed draft is not what I would have drafted had I gotten
there first, I have been willing to support it because I feel it
provides helpful constraints on the internal structure of RFC 3066
language tags.



> We have
> tag review processes to prevent just that eventuality.

I have been party to the review process for the past five or so years,
and can say that the review process did not, IMO, always succeed in
avoiding regretable tags (I do not consider those that include script
IDs to be among them) because there was a lack of a model of what
ontology was needing to be described and what the appropriate elements
within a tag standing in what kind of relationship to one another were
needed. This draft doesn't describe such a model, but it does impose
one, which I think is moving in a good directiton. 


> > There may be implementations that use a more complex approach
> > to matching involving inspection of the tagged content itself,
> > or inspecting the particular subtags of a language tag.
> >...
> 
> Peter, you are talking, I think, about different applications
> doing different things given the greater range of options and
> flexibility that the new specification provides.

Actually, no; I was trying to guess at existing applications that might
have particular problems with complexity, as you mentioned. Certainly
language-range matching is no more complex in the proposed draft than it
is today. I personally suspect that the language-range matching
algorithm is too simplistic, but I haven't gone beyond that myself to
start suggesting it needs to be replaced with something more complex.



> Let me also comment on the ISO 3166 issues here...   But
> the solution to the problem of various ISO TCs not having an
> adequate understanding of the impact on the Internet and IT
> communities (and, in the case of TC46, even the
> library/information sciences community that are one of their
> historical main constituencies) is, IMO, to get that message
> across via liaison statements and, if necessary and appropriate,
> encouraging national member bodies to cast "no" votes on
> standards and registration procedures that are insufficiently
> stable.  After the "CS" decision, the statements from the
> British Library advocating a much longer time-to-reuse and from
> the IAB suggesting that a century might be adequate were, again,
> IMO, just the right sort of approach.   In particular, I presume
> that TC 37 has an adequate liaison mechanism in place with TC 46
> to insist that a much more conservative position be adopted with
> regard to changes.  If TC 37 isn't able or inclined to do that
> job effectively, I'm not persuaded that shifting the task to the
> IETF is an appropriate solution or one that is likely to be
> effective.

For my part, I made a point of informing TC 37 members of the
re-assignment of CS, and that led to a resolution at our Paris meeting
last August expressing strong concern over this. I did not ever hear any
response from either TC 46 or the ISO 3166 MA on this matter, however. I
don't know that I would have devised the approach to the handling of
this issue used in this draft had I been its author. I am deeply
concerned that stability be ensured in language tags, however, and if
this is the only way to ensure it I can accept it. 

Of course, your point is that it probably is neither the only nor the
best way to ensure this. I have no comments to counter that opinion.

Regards,
Peter Constable


_______________________________________________

Ietf@xxxxxxxx
https://www1.ietf.org/mailman/listinfo/ietf


[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Fedora Users]