Hi Martin,
thank you for these comments. Late changes are always risky, but we think these comments do lead to a very desirable improvement of the document.
We have prepared a pull request at:
https://github.com/core-wg/core-problem-details/pull/40
I think that Language-Tagged Strings (CBOR Tag 38, https://datatracker.ietf.org/doc/html/draft-ietf-core-problem-details-06#appendix-A) are a very good step ahead. At least for CBOR, in many cases from now on, the answer might just be "use Tag 38" (assuming we get the details right).
Indeed. The original step was taken by Peter Occil in 2014 IIRC; this specification just goes ahead and adds writing direction.
Unfortunately neither definition is problem-free.
First of all, this tag, if useful at all, is of far greater utility than the error format. Burying it in an appendix of a document whose stated purpose is something else makes it far more difficult to refer to than it needs to be.
That is usually not a problem. The focal point for finding a CBOR tag for a specific application is the CBOR tag registry; this then points to the places where the specifications for the tags can be found (which in this case is easily expressed as “Appendix A of RFC XXXX”).
Separate Draft or Not =====================
I agree with Harald that it should be a separate draft; it would definitely help with visibility of I18N in general and the issue of strings with language and directionality information inside and outside the IETF (not only the visibility within the CBOR community, which may be covered by the tag registry). Being able to say "look at RFC XXXX for a good example" is way better than being able to say "look at appendix X of RFC YYYY for a good example".
Actually, “look at RFC XXXX for a good example” is going to be the outcome of the combined document, because the document not only defines tag 38 (in Appendix A), but also shows a couple examples that use it (in the main body) and even an instance where we decided to unravel it (SPDe -6/-7). So I’m in favor of keeping this document together.
I also really want to push back against doing this kind of surgery at this stage of the document. (On the other hand, a split might increase the number of RFCs I’m a co-author on… Nooo, just kidding.)
Second, the “detailed semantics” has chosen to include the quite complex BNF of RFC 5646 translated into CDDL; this may have some use, but BCP 47 is a moving target;
We intend tag38 to be useful for the current form of BCP 47, so it is hard to plan for the future. If BCP 47 needs to be considered unstable, we could of course define a “bcp47-extension” alternative with a CDDL feature control operator.
(NOT!) Copying BCP 47 Grammar =============================
I also agree with Harald that the definition of 'Language-Tagged Strings' has room for improvement. First, as Harald said, it repeats the BCP 47 grammar when we very well know that repeating grammars is usually a bad idea. I'm really not sure why CBOR wants to check each and every detail of the current language tag syntax. My understanding was that CBOR was (among else if not primarily) for constrained devices. I just cannot see the motivation of embedding a list of legacy tags into a constrained device.
I also don't know about other technology on a similar level as CBOR that would do so. As an example, XML had productions 33-38 (see https://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag), but they were removed as early as 2000 (see https://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag), for very good reasons. I really have difficulties to imagine why CBOR would want to make the same mistake that XML fixed more than 20 years ago.
Similarly, XML Schema Datatypes only gives a very simple regular _expression_ ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes (see https://www.w3.org/TR/xmlschema11-2/#language):
[[[[ Note: The regular _expression_ above provides the only normative constraint on the lexical and value spaces of this type. The additional constraints imposed on language identifiers by [BCP 47] and its successor(s), and in particular their requirement that language codes be registered with IANA or ISO if not given in ISO 639, are not part of this datatype as defined here. ]]]] Again, XML Schema would have done something more precise if anybody had been convinced that such precision made sense.
We tend towards not reading ABNF in RFCs as “The code is more what you'd call 'guidelines' than actual rules” [1].
But if that is indeed the correct view of BCP47, simplifying the grammar to [a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})* certainly is one way of adding flexibility.
➔ https://github.com/core-wg/core-problem-details/pull/40/commits/bbe72e2
Another way to see this is that in general, when giving restricting syntactic rules, there's the question of "bang for the buck". The complexity of the language tag syntax rules, down to the legacy (grandfathered) stuff, mean that the cost ("buck") is quite high. This not only includes implementation and memory footprint, but also testing and everything else. […]
Most of the cost for this grammar was paid when RFC 5646 was written. Nobody is forced to validate against this grammar. But that is maybe water under the bridge with the above PR.
Regarding language tags, in addition, there is the following note: [[[[ NOTE: The Unicode Standard [Unicode-14.0.0] includes a set of characters designed for tagging text (including language tagging), in the range U+E0000 to U+E007F. Although many applications, including RDF, do not disallow these characters in text strings, the Unicode Consortium has deprecated these characters and recommends annotating language via a higher-level protocol instead. See the section "Deprecated Tag Characters" in Section 23.9 of [Unicode-14.0.0]. ]]]] It's weird for the IETF to refer (only) to the Unicode standard here even though the IETF has deprecated this kind of language tagging in RFC 6082. (see https://www.rfc-editor.org/rfc/rfc6082.html) So please cite that RFC.
Good point.
Added to PR: https://github.com/core-wg/core-problem-details/pull/40/commits/a5d900d
[…]
No, please. I understand that in some areas, you don't want to allow gratuitous changes to your network and software based on changes to technology that you use. But for language tags, such a mindset is really counterproductive. Some of the changes to BCP 47 that have been discussed are to include some subtags for dialects.
Got it, “guidelines not rules” :-) See PR above.
Note also that the sentence “Data items with tag 38 that do not meet the criteria above are invalid (see Section 5.3.2 of [STD94]).” is really hard to parse semantically, given that section 5.3.2 of RFC 8949 doesn’t use the word “invalid”, it uses “inadmissible value”. I do not recommend rejecting unknown language tags.
They may not be rejected, they are just not “valid” in RFC 8949 sense (they are still well-formed). I would expect language tags to evolve within the grammar defined by RFC 5646 (which does have an extension point); it that is a mistaken assumption, please let us know.
In the short term (my average guess at "short term" would be 10 years or so), evolution *within* RFC 5646 is definitely the main focus. In the really long term, I guess anything that fits the XML Schema production is fair game. That restriction has been there since the original RFC 1766, and provides some actual "bang for the buck". It is also baked in into technologies such as XML Schema which would provide a very strong argument to not give up on it. In all the work on revising RFC 1766 (which I co-chaired, and which was quite long-winded), changing the rule that each subtag had to be 8 characters or less was never strongly disputed at all.
OK, see above.
Directionality Information ==========================
(Moved as suggested)
Thirdly, the definition of the tri-state direction attribute can be made clearer; in particular, the Unicode Bidirectional Algorithm (UAX#9) should be referenced, with particular reference to https://www.unicode.org/reports/tr9/tr9-44.html#Markup_And_Formatting - the important property here is that the desired semantic is isolation - the markup is intended to have zero influence on strings outside the embedded string - the semantics of embedding in RLI…PDI is the desired effect.
Tag38 does not provide a way to handle embedding, so we are not trying to boil that ocean yet.
Again, I agree with Harald here. But first, please be careful. "embedding" has a very narrow technical meaning in the Bidi Algorithm (UAX #9). Tag 38 doesn't need a way to handle embeddings in this sense. When Harald used the term "embedded string", he didn't use "embedded" in this very narrow technical sense, but in a more general sense, namely that the string from Tag 38 is expected to be put into some (surrounding) context. That might mean that it shows up by itself somewhere, or that it gets included in a larger text of some sorts.
Thank you for that clarification.
In the draft, you have the following text: [[[[ The optional third element, if present, is a Boolean value that indicates a direction: false for "ltr" direction, true for "rtl" direction. If the third element is absent, no indication is made about the direction; it can be explicitly given as null to express the same while overriding any context that might be considered applying to this element. Note that the proper processing of Language and Direction Metadata is an active area of investigation; the reader is advised to consult ongoing standardization activities such as [STRING-META] when processing the information represented in this tag. ]]]]
[override is also a technical term in the Bidi Algorithm]
I think this text is very important, so I'll got into some details. First (minor nit), it says "If the third element is absent ...". Because this is in a paragraph that starts with "The optional third element ...", I think it would better say "If this element is absent ...".
Replaced by (a form of) your text…
➔ https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9
Next, let me make sure that I get this right: This is a Boolean value, but it can in effect have four different states, yes? That would be: - True (rtl) - False (ltr) - null (no indication about direction, but overriding any context)
- absent (no indication about direction, but context may apply) If that's true, then it might be good to put that into a more structured from (something like the above list).
Thanks, see below. (A value that is absent is not a value; its representation by a null value may be needed to ~~override~~ reset any context available.)
[very major point] The main problem is with the last sentence. There's not much of a point in defining a field for directionality if it's not clear what that is supposed to be used for. I'm also not sure where the claim "the proper processing of Language and Direction Metadata is an active area of investigation" came from, and why it is here.
I believe this statement is rather important, as it does spell out the requirement to stay abreast with the developments in this space. The tag 38 information provides an input to the algorithm that we just need to assume will survive revisions to that algorithm; but the algorithm may be revised.
It is true that some areas of bidi processing (e.g. the best consistent way to display IRIs that contain pieces of text from both directionalities) that are not solved yet, or even (as the example a line ago) are not even actively being investigated because the general agreement is that the problem is too difficult to have a solution. It is also true that "Strings on the Web: Language and Direction Metadata" (https://www.w3.org/TR/string-meta/) is still in Draft status.
Hence the informative reference.
But neither of these facts should have to influence the specification of Tag 38. [StringMeta] (3.4 What consumers need to do to support direction, https://www.w3.org/TR/string-meta/#what_consumers_do), Harald and I all agree about what the right thing to do is: Use Bidi isolation (in the technical sense of https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates).
So given all the above considerations, what about rewriting the paragraph under consideration along the following lines:
[[[[ The optional third element, if present, is a Boolean value that indicates a direction, as follows: - false: LTR direction. The text is expected to be displayed with LTR base direction if standalone, and isolated with LTR direction (enclosed in RLI ... PDI or equivalent, see [1]) in the context of a longer string or text. - true: RTL direction. The text is expected to be displayed with LTR base direction if standalone, and isolated with RTL direction (enclosed in LRI ... PDI or equivalent, see [1]) in the context of a longer string or text. - absent: no indication is made about the direction - (explicit) null: no indication is made about the direction, but any directionality context applying to this element (e.g., base directionality information for an entire CBOR message or part thereof) is ignored. ]]]] [1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section 2.7 Markup and Formatting Characters, https://www.unicode.org/reports/tr9/#Markup_And_Formatting
Thank you; I massaged the text slightly in the above-mentioned PR, i.e.:
➔ https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9
I'm not really sure yet about the 'absent' and 'null' entries, neither if they are really distinct nor whether the specification is good enough (we might want to specify FIRST STRONG ISOLATE semantics).
We could, but I’m not sure that part of “auto” semantics is as stable as the rest. The first character with strong directionality is often rather random and therefore can lead to surprising results. I would expect implementations to develop stronger heuristics here.
Grüße, Carsten
[1]: a line from “Pirates of the Caribbean”, spoken by a role whose name always reminds me of Bar BOFs :-) |