[just a very minor correction to my comments below:
The heading "Directionality Information" should be moved down, just
below the text "boil that ocean".]
Regards, Martin.
On 2022-06-23 15:47, Martin J. Dürst wrote:
Dear Core and I18N experts,
Some comments on the I18N aspects of Tag 38 below.
[Sorry this answer took so long, and got so long. The two 'long's
influenced each other :-).]
On 2022-06-16 01:23, Carsten Bormann wrote:
Hi Harald,
thank you for this thoughtful review.
The “Tag 38 internationalized string”
This document adds an appendix defining an “internationalized string”
format
that adds a BCP 47 language tag and an Unicode-based direction
indicator to an
UTF-8 string. This is laudable; RFC 2277 section 4 pointed out the
need for
this ability 24 years ago.
I think that Language-Tagged Strings (CBOR Tag 38,
https://datatracker.ietf.org/doc/html/draft-ietf-core-problem-details-06#appendix-A)
are a very good step ahead. At least for CBOR, in many cases from now
on, the answer might just be "use Tag 38" (assuming we get the details
right).
Unfortunately neither definition is problem-free.
First of all, this tag, if useful at all, is of far greater utility
than the
error format. Burying it in an appendix of a document whose stated
purpose is
something else makes it far more difficult to refer to than it needs
to be.
That is usually not a problem. The focal point for finding a CBOR tag
for a specific application is the CBOR tag registry; this then points
to the places where the specifications for the tags can be found
(which in this case is easily expressed as “Appendix A of RFC XXXX”).
Separate Draft or Not
=====================
I agree with Harald that it should be a separate draft; it would
definitely help with visibility of I18N in general and the issue of
strings with language and directionality information inside and outside
the IETF (not only the visibility within the CBOR community, which may
be covered by the tag registry). Being able to say "look at RFC XXXX for
a good example" is way better than being able to say "look at appendix X
of RFC YYYY for a good example".
I understand Francesca's arguments, too, but I think the investment in a
separate draft would be well worth the effort. I'm willing to contribute
although I guess that Carsten would do the necessary work in less time
than it takes him to get anybody else up to speed.
Second, the “detailed semantics” has chosen to include the quite
complex BNF of
RFC 5646 translated into CDDL; this may have some use, but BCP 47 is
a moving
target;
We intend tag38 to be useful for the current form of BCP 47, so it is
hard to plan for the future. If BCP 47 needs to be considered
unstable, we could of course define a “bcp47-extension” alternative
with a CDDL feature control operator.
(NOT!) Copying BCP 47 Grammar
=============================
I also agree with Harald that the definition of 'Language-Tagged
Strings' has room for improvement. First, as Harald said, it repeats the
BCP 47 grammar when we very well know that repeating grammars is usually
a bad idea. I'm really not sure why CBOR wants to check each and every
detail of the current language tag syntax. My understanding was that
CBOR was (among else if not primarily) for constrained devices. I just
cannot see the motivation of embedding a list of legacy tags into a
constrained device.
I also don't know about other technology on a similar level as CBOR that
would do so. As an example, XML had productions 33-38 (see
https://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag), but they were
removed as early as 2000 (see
https://www.w3.org/TR/2000/REC-xml-20001006#sec-lang-tag), for very good
reasons. I really have difficulties to imagine why CBOR would want to
make the same mistake that XML fixed more than 20 years ago.
Similarly, XML Schema Datatypes only gives a very simple regular
expression ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes
(see https://www.w3.org/TR/xmlschema11-2/#language):
[[[[
Note: The regular expression above provides the only normative
constraint on the lexical and value spaces of this type. The additional
constraints imposed on language identifiers by [BCP 47] and its
successor(s), and in particular their requirement that language codes be
registered with IANA or ISO if not given in ISO 639, are not part of
this datatype as defined here.
]]]]
Again, XML Schema would have done something more precise if anybody had
been convinced that such precision made sense.
Another way to see this is that in general, when giving restricting
syntactic rules, there's the question of "bang for the buck". The
complexity of the language tag syntax rules, down to the legacy
(grandfathered) stuff, mean that the cost ("buck") is quite high. This
not only includes implementation and memory footprint, but also testing
and everything else.
On the other hand, the "bang" is quite low, because of two reasons:
First, without a check against the registry, a lot of garbage still can
go through. Think e.g. "en-UK", which looks reasonable and fits the
grammar, but is not allowed (UK is not a country code, "en-GB" is
correct). Second, most actual language tags, in particular for
constrained devices, are more on the level of "fr" or "en-US", which
means that on most actual data, the full syntax isn't really exercised.
Which further means that software with implementation bugs in the syntax
testing part doesn't get weeded out.
The main mechanisms (if any) that will help to make sure these language
tags are correct are the following:
1) On the 'sender' side, texts will be translated, by "hand" or using
some localization tools, and the correct language tags will be set there
(because somebody translating to Ukrainian, or their tool, knows the
correct tag is "uk", and not something else).
2) On the 'receiver' side, user preferences will be expressed as
language tags (or prefixes,...), which should assure that correctly
tagged data gets shown and incorrectly tagged data gets ignored.
To summarize, copying the grammar from BCP 47 brings extremely little
bang for rather high costs. Get rid of it in the same way other
standards which have thought this through have gone rid of a detailled
grammar. If you want something that gives you a minimal plausibility
test (catch cases where e.g. the text and the language tag got swapped
by some accident,...), do what XML Schema did.
This will also be future proof. There are many changes to BCP 47 that
have been discussed in the past (although none of these got traction, or
are expected to get traction in the near future), but changing the basic
syntax constraint expressed by XML Schema was never considered an
option. On the other hand, it was always clear to the people involved
that users of language tags shouldn't create artificial barriers to
future changes. It would be really a pity if CBOR created such a barrier
just because they could. Things such as "CDDL feature control operators"
are great where they actually serve a purpose, here I don't think they
would.
Directionality Information
==========================
Regarding language tags, in addition, there is the following note:
[[[[
NOTE: The Unicode Standard [Unicode-14.0.0] includes a set of
characters designed for tagging text (including language tagging), in
the range U+E0000 to U+E007F. Although many applications, including
RDF, do not disallow these characters in text strings, the Unicode
Consortium has deprecated these characters and recommends annotating
language via a higher-level protocol instead. See the section
"Deprecated Tag Characters" in Section 23.9 of [Unicode-14.0.0].
]]]]
It's weird for the IETF to refer (only) to the Unicode standard here
even though the IETF has deprecated this kind of language tagging in RFC
6082. (see https://www.rfc-editor.org/rfc/rfc6082.html) So please cite
that RFC.
having CDDL parsers try to validate tags according to this grammar is
not going to be useful. If included at all, this needs to be clearly
marked
with text saying that BCP 47 is normative for this grammar, and that
language
tag parsers should NOT try to reject tags based on this grammar;
instead, they
should be treated as strings, and looked up against relevant language
handling
APIs. (“zh-ZZ” is perfectly valid according to the grammar, but is
semantically
invalid according to BCP 47).
Here again, it is hard to capture semantics in a structural definition.
Our document is going to reference RFC 5646 (including its ABNF), as
that is the current definition; if BCP 47 is updated, the effect of
that update on this document will need new consideration.
No, please. I understand that in some areas, you don't want to allow
gratuitous changes to your network and software based on changes to
technology that you use. But for language tags, such a mindset is really
counterproductive. Some of the changes to BCP 47 that have been
discussed are to include some subtags for dialects. Now if such a change
happened, there are two questions relevant for CBOR:
1) How many cases would there be in the CBOR landscape where people
would want to use such subtags? The answer would probably be: Very few,
so a change (using a "CDDL feature control operator" or whatever) would
have very low priority. But why should people be prohibited from using
such subtags if they want to use them?
2) What's the problem in letting such subtags though the current
infrastructure? My guess is that there's no problem at all. When there
are parallel texts, one tagged with "en-US" and the other with one of
these dialect subtags, the chance is very high that a recipient will be
displaying the former. Would that be a problem?
Note also that the sentence “Data items with tag
38 that do not meet the criteria above are invalid (see Section 5.3.2 of
[STD94]).” is really hard to parse semantically, given that section
5.3.2 of
RFC 8949 doesn’t use the word “invalid”, it uses “inadmissible
value”. I do not
recommend rejecting unknown language tags.
They may not be rejected, they are just not “valid” in RFC 8949 sense
(they are still well-formed). I would expect language tags to evolve
within the grammar defined by RFC 5646 (which does have an extension
point); it that is a mistaken assumption, please let us know.
In the short term (my average guess at "short term" would be 10 years or
so), evolution *within* RFC 5646 is definitely the main focus. In the
really long term, I guess anything that fits the XML Schema production
is fair game. That restriction has been there since the original RFC
1766, and provides some actual "bang for the buck". It is also baked in
into technologies such as XML Schema which would provide a very strong
argument to not give up on it. In all the work on revising RFC 1766
(which I co-chaired, and which was quite long-winded), changing the rule
that each subtag had to be 8 characters or less was never strongly
disputed at all.
Thirdly, the definition of the tri-state direction attribute can be made
clearer; in particular, the Unicode Bidirectional Algorithm (UAX#9)
should be
referenced, with particular reference to
https://www.unicode.org/reports/tr9/tr9-44.html#Markup_And_Formatting
- the
important property here is that the desired semantic is isolation -
the markup
is intended to have zero influence on strings outside the embedded
string - the
semantics of embedding in RLI…PDI is the desired effect.
Tag38 does not provide a way to handle embedding, so we are not trying
to boil that ocean yet.
Again, I agree with Harald here. But first, please be careful.
"embedding" has a very narrow technical meaning in the Bidi Algorithm
(UAX #9). Tag 38 doesn't need a way to handle embeddings in this sense.
When Harald used the term "embedded string", he didn't use "embedded" in
this very narrow technical sense, but in a more general sense, namely
that the string from Tag 38 is expected to be put into some
(surrounding) context. That might mean that it shows up by itself
somewhere, or that it gets included in a larger text of some sorts.
In the draft, you have the following text:
[[[[
The optional third element, if present, is a Boolean value that
indicates a direction: false for "ltr" direction, true for "rtl"
direction. If the third element is absent, no indication is made
about the direction; it can be explicitly given as null to express
the same while overriding any context that might be considered
applying to this element. Note that the proper processing of
Language and Direction Metadata is an active area of investigation;
the reader is advised to consult ongoing standardization activities
such as [STRING-META] when processing the information represented in
this tag.
]]]]
[override is also a technical term in the Bidi Algorithm]
I think this text is very important, so I'll got into some details.
First (minor nit), it says "If the third element is absent ...". Because
this is in a paragraph that starts with "The optional third element
...", I think it would better say "If this element is absent ...".
Next, let me make sure that I get this right: This is a Boolean value,
but it can in effect have four different states, yes? That would be:
- True (rtl)
- False (ltr)
- null (no indication about direction, but overriding any context)
- absent (no indication about direction, but context may apply)
If that's true, then it might be good to put that into a more structured
from (something like the above list).
[very major point] The main problem is with the last sentence. There's
not much of a point in defining a field for directionality if it's not
clear what that is supposed to be used for. I'm also not sure where the
claim "the proper processing of Language and Direction Metadata is an
active area of investigation" came from, and why it is here.
It is true that some areas of bidi processing (e.g. the best consistent
way to display IRIs that contain pieces of text from both
directionalities) that are not solved yet, or even (as the example a
line ago) are not even actively being investigated because the general
agreement is that the problem is too difficult to have a solution.
It is also true that "Strings on the Web: Language and Direction
Metadata" (https://www.w3.org/TR/string-meta/) is still in Draft status.
But neither of these facts should have to influence the specification of
Tag 38. [StringMeta] (3.4 What consumers need to do to support
direction, https://www.w3.org/TR/string-meta/#what_consumers_do), Harald
and I all agree about what the right thing to do is: Use Bidi isolation
(in the technical sense of
https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates).
So given all the above considerations, what about rewriting the
paragraph under consideration along the following lines:
[[[[
The optional third element, if present, is a Boolean value that
indicates a direction, as follows:
- false: LTR direction. The text is expected to be displayed
with LTR base direction if standalone, and isolated with LTR
direction (enclosed in RLI ... PDI or equivalent, see [1]) in
the context of a longer string or text.
- true: RTL direction. The text is expected to be displayed
with LTR base direction if standalone, and isolated with RTL
direction (enclosed in LRI ... PDI or equivalent, see [1]) in
the context of a longer string or text.
- absent: no indication is made about the direction
- (explicit) null: no indication is made about the direction,
but any directionality context applying to this element (e.g.,
base directionality information for an entire CBOR message or
part thereof) is ignored.
]]]]
[1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section
2.7 Markup and Formatting Characters,
https://www.unicode.org/reports/tr9/#Markup_And_Formatting
I'm not really sure yet about the 'absent' and 'null' entries, neither
if they are really distinct nor whether the specification is good enough
(we might want to specify FIRST STRONG ISOLATE semantics).
Hope this helps. Let's make sure together that we get this right.
Regards, Martin.
_______________________________________________
art mailing list
art@xxxxxxxx
https://www.ietf.org/mailman/listinfo/art
--
Prof. Dr.sc. Martin J. Dürst
Department of Intelligent Information Technology
College of Science and Engineering
Aoyama Gakuin University
Fuchinobe 5-1-10, Chuo-ku, Sagamihara
252-5258 Japan
--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call