Re: [Last-Call] [art] Artart last call review of draft-ietf-core-problem-details-05

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Dear Core and I18N experts,

Some comments on the I18N aspects of Tag 38 below.

[Sorry this answer took so long, and got so long. The two 'long's influenced each other :-).]

On 2022-06-16 01:23, Carsten Bormann wrote:

Hi Harald,

thank you for this thoughtful review.

The “Tag 38 internationalized string”
This document adds an appendix defining an “internationalized string” format
that adds a BCP 47 language tag and an Unicode-based direction indicator to an
UTF-8 string. This is laudable; RFC 2277 section 4 pointed out the need for
this ability 24 years ago.

I think that Language-Tagged Strings (CBOR Tag 38, https://datatracker.ietf.org/doc/html/draft-ietf-core-problem-details-06#appendix-A) are a very good step ahead. At least for CBOR, in many cases from now on, the answer might just be "use Tag 38" (assuming we get the details right).


Unfortunately neither definition is problem-free.

First of all, this tag, if useful at all, is of far greater utility than the
error format. Burying it in an appendix of a document whose stated purpose is
something else makes it far more difficult to refer to than it needs to be.

That is usually not a problem.  The focal point for finding a CBOR tag for a specific application is the CBOR tag registry; this then points to the places where the specifications for the tags can be found (which in this case is easily expressed as “Appendix A of RFC XXXX”).

Separate Draft or Not
=====================

I agree with Harald that it should be a separate draft; it would definitely help with visibility of I18N in general and the issue of strings with language and directionality information inside and outside the IETF (not only the visibility within the CBOR community, which may be covered by the tag registry). Being able to say "look at RFC XXXX for a good example" is way better than being able to say "look at appendix X of RFC YYYY for a good example".

I understand Francesca's arguments, too, but I think the investment in a separate draft would be well worth the effort. I'm willing to contribute although I guess that Carsten would do the necessary work in less time than it takes him to get anybody else up to speed.


Second, the “detailed semantics” has chosen to include the quite complex BNF of
RFC 5646 translated into CDDL; this may have some use, but BCP 47 is a moving
target;

We intend tag38 to be useful for the current form of BCP 47, so it is hard to plan for the future.  If BCP 47 needs to be considered unstable, we could of course define a “bcp47-extension” alternative with a CDDL feature control operator.

(NOT!) Copying BCP 47 Grammar
=============================

I also agree with Harald that the definition of 'Language-Tagged Strings' has room for improvement. First, as Harald said, it repeats the BCP 47 grammar when we very well know that repeating grammars is usually a bad idea. I'm really not sure why CBOR wants to check each and every detail of the current language tag syntax. My understanding was that CBOR was (among else if not primarily) for constrained devices. I just cannot see the motivation of embedding a list of legacy tags into a constrained device.

I also don't know about other technology on a similar level as CBOR that would do so. As an example, XML had productions 33-38 (see https://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag), but they were removed as early as 2000 (see https://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag), for very good reasons. I really have difficulties to imagine why CBOR would want to make the same mistake that XML fixed more than 20 years ago.

Similarly, XML Schema Datatypes only gives a very simple regular expression ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes
(see https://www.w3.org/TR/xmlschema11-2/#language):

[[[[
Note: The regular expression above provides the only normative constraint on the lexical and value spaces of this type. The additional constraints imposed on language identifiers by [BCP 47] and its successor(s), and in particular their requirement that language codes be registered with IANA or ISO if not given in ISO 639, are not part of this datatype as defined here.
]]]]
Again, XML Schema would have done something more precise if anybody had been convinced that such precision made sense.


Another way to see this is that in general, when giving restricting syntactic rules, there's the question of "bang for the buck". The complexity of the language tag syntax rules, down to the legacy (grandfathered) stuff, mean that the cost ("buck") is quite high. This not only includes implementation and memory footprint, but also testing and everything else.

On the other hand, the "bang" is quite low, because of two reasons:
First, without a check against the registry, a lot of garbage still can go through. Think e.g. "en-UK", which looks reasonable and fits the grammar, but is not allowed (UK is not a country code, "en-GB" is correct). Second, most actual language tags, in particular for constrained devices, are more on the level of "fr" or "en-US", which means that on most actual data, the full syntax isn't really exercised. Which further means that software with implementation bugs in the syntax testing part doesn't get weeded out.

The main mechanisms (if any) that will help to make sure these language tags are correct are the following: 1) On the 'sender' side, texts will be translated, by "hand" or using some localization tools, and the correct language tags will be set there (because somebody translating to Ukrainian, or their tool, knows the correct tag is "uk", and not something else). 2) On the 'receiver' side, user preferences will be expressed as language tags (or prefixes,...), which should assure that correctly tagged data gets shown and incorrectly tagged data gets ignored.

To summarize, copying the grammar from BCP 47 brings extremely little bang for rather high costs. Get rid of it in the same way other standards which have thought this through have gone rid of a detailled grammar. If you want something that gives you a minimal plausibility test (catch cases where e.g. the text and the language tag got swapped by some accident,...), do what XML Schema did.

This will also be future proof. There are many changes to BCP 47 that have been discussed in the past (although none of these got traction, or are expected to get traction in the near future), but changing the basic syntax constraint expressed by XML Schema was never considered an option. On the other hand, it was always clear to the people involved that users of language tags shouldn't create artificial barriers to future changes. It would be really a pity if CBOR created such a barrier just because they could. Things such as "CDDL feature control operators" are great where they actually serve a purpose, here I don't think they would.


Directionality Information
==========================

Regarding language tags, in addition, there is the following note:
[[[[
NOTE: The Unicode Standard [Unicode-14.0.0] includes a set of
   characters designed for tagging text (including language tagging), in
   the range U+E0000 to U+E007F.  Although many applications, including
   RDF, do not disallow these characters in text strings, the Unicode
   Consortium has deprecated these characters and recommends annotating
   language via a higher-level protocol instead.  See the section
   "Deprecated Tag Characters" in Section 23.9 of [Unicode-14.0.0].
]]]]
It's weird for the IETF to refer (only) to the Unicode standard here even though the IETF has deprecated this kind of language tagging in RFC 6082. (see https://www.rfc-editor.org/rfc/rfc6082.html) So please cite that RFC.


having CDDL parsers try to validate tags according to this grammar is
not going to be useful. If included at all, this needs to be clearly marked
with text saying that BCP 47 is normative for this grammar, and that language
tag parsers should NOT try to reject tags based on this grammar; instead, they
should be treated as strings, and looked up against relevant language handling
APIs. (“zh-ZZ” is perfectly valid according to the grammar, but is semantically
invalid according to BCP 47).

Here again, it is hard to capture semantics in a structural definition.
Our document is going to reference RFC 5646 (including its ABNF), as that is the current definition; if BCP 47 is updated, the effect of that update on this document will need new consideration.

No, please. I understand that in some areas, you don't want to allow gratuitous changes to your network and software based on changes to technology that you use. But for language tags, such a mindset is really counterproductive. Some of the changes to BCP 47 that have been discussed are to include some subtags for dialects. Now if such a change happened, there are two questions relevant for CBOR: 1) How many cases would there be in the CBOR landscape where people would want to use such subtags? The answer would probably be: Very few, so a change (using a "CDDL feature control operator" or whatever) would have very low priority. But why should people be prohibited from using such subtags if they want to use them? 2) What's the problem in letting such subtags though the current infrastructure? My guess is that there's no problem at all. When there are parallel texts, one tagged with "en-US" and the other with one of these dialect subtags, the chance is very high that a recipient will be displaying the former. Would that be a problem?


Note also that the sentence “Data items with tag
38 that do not meet the criteria above are invalid (see Section 5.3.2 of
[STD94]).” is really hard to parse semantically, given that section 5.3.2 of
RFC 8949 doesn’t use the word “invalid”, it uses “inadmissible value”. I do not
recommend rejecting unknown language tags.

They may not be rejected, they are just not “valid” in RFC 8949 sense (they are still well-formed).  I would expect language tags to evolve within the grammar defined by RFC 5646 (which does have an extension point); it that is a mistaken assumption, please let us know.

In the short term (my average guess at "short term" would be 10 years or so), evolution *within* RFC 5646 is definitely the main focus. In the really long term, I guess anything that fits the XML Schema production is fair game. That restriction has been there since the original RFC 1766, and provides some actual "bang for the buck". It is also baked in into technologies such as XML Schema which would provide a very strong argument to not give up on it. In all the work on revising RFC 1766 (which I co-chaired, and which was quite long-winded), changing the rule that each subtag had to be 8 characters or less was never strongly disputed at all.


Thirdly, the definition of the tri-state direction attribute can be made
clearer; in particular, the Unicode Bidirectional Algorithm (UAX#9) should be
referenced, with particular reference to
https://www.unicode.org/reports/tr9/tr9-44.html#Markup_And_Formatting - the
important property here is that the desired semantic is isolation - the markup
is intended to have zero influence on strings outside the embedded string - the
semantics of embedding in RLI…PDI is the desired effect.

Tag38 does not provide a way to handle embedding, so we are not trying to boil that ocean yet.

Again, I agree with Harald here. But first, please be careful. "embedding" has a very narrow technical meaning in the Bidi Algorithm (UAX #9). Tag 38 doesn't need a way to handle embeddings in this sense. When Harald used the term "embedded string", he didn't use "embedded" in this very narrow technical sense, but in a more general sense, namely that the string from Tag 38 is expected to be put into some (surrounding) context. That might mean that it shows up by itself somewhere, or that it gets included in a larger text of some sorts.

In the draft, you have the following text:
[[[[
   The optional third element, if present, is a Boolean value that
   indicates a direction: false for "ltr" direction, true for "rtl"
   direction.  If the third element is absent, no indication is made
   about the direction; it can be explicitly given as null to express
   the same while overriding any context that might be considered
   applying to this element.  Note that the proper processing of
   Language and Direction Metadata is an active area of investigation;
   the reader is advised to consult ongoing standardization activities
   such as [STRING-META] when processing the information represented in
   this tag.
]]]]

[override is also a technical term in the Bidi Algorithm]

I think this text is very important, so I'll got into some details. First (minor nit), it says "If the third element is absent ...". Because this is in a paragraph that starts with "The optional third element ...", I think it would better say "If this element is absent ...".

Next, let me make sure that I get this right: This is a Boolean value, but it can in effect have four different states, yes? That would be:
- True (rtl)
- False (ltr)
- null (no indication about direction, but overriding any context)
- absent (no indication about direction, but context may apply)
If that's true, then it might be good to put that into a more structured from (something like the above list).

[very major point] The main problem is with the last sentence. There's not much of a point in defining a field for directionality if it's not clear what that is supposed to be used for. I'm also not sure where the claim "the proper processing of Language and Direction Metadata is an active area of investigation" came from, and why it is here.

It is true that some areas of bidi processing (e.g. the best consistent way to display IRIs that contain pieces of text from both directionalities) that are not solved yet, or even (as the example a line ago) are not even actively being investigated because the general agreement is that the problem is too difficult to have a solution. It is also true that "Strings on the Web: Language and Direction Metadata" (https://www.w3.org/TR/string-meta/) is still in Draft status.

But neither of these facts should have to influence the specification of Tag 38. [StringMeta] (3.4 What consumers need to do to support direction, https://www.w3.org/TR/string-meta/#what_consumers_do), Harald and I all agree about what the right thing to do is: Use Bidi isolation (in the technical sense of https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates).

So given all the above considerations, what about rewriting the paragraph under consideration along the following lines:

[[[[
   The optional third element, if present, is a Boolean value that
   indicates a direction, as follows:
   - false: LTR direction. The text is expected to be displayed
     with LTR base direction if standalone, and isolated with LTR
     direction (enclosed in RLI ... PDI or equivalent, see [1]) in
     the context of a longer string or text.
   - true: RTL direction. The text is expected to be displayed
     with LTR base direction if standalone, and isolated with RTL
     direction (enclosed in LRI ... PDI or equivalent, see [1]) in
     the context of a longer string or text.
   - absent: no indication is made about the direction
   - (explicit) null: no indication is made about the direction,
     but any directionality context applying to this element (e.g.,
     base directionality information for an entire CBOR message or
     part thereof) is ignored.
]]]]
[1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section 2.7 Markup and Formatting Characters, https://www.unicode.org/reports/tr9/#Markup_And_Formatting

I'm not really sure yet about the 'absent' and 'null' entries, neither if they are really distinct nor whether the specification is good enough (we might want to specify FIRST STRONG ISOLATE semantics).


Hope this helps. Let's make sure together that we get this right.

Regards,    Martin.

--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call




[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux