Hello Carsten,
On 2022-06-24 05:35, Carsten Bormann wrote:
Hi Martin,
thank you for these comments.
Thanks for your very quick reply and action.
Late changes are always risky,
Well, the review was requested on May 27, and it's now exactly 4 weeks
since then. If I didn't feel I had to explain some of the main points in
Harald's review in great detail, it might have been just 2.5 weeks. If
that's late, let's try to make sure I18N reviews happen earlier.
(see https://datatracker.ietf.org/group/i18ndir/reviews/)
but we think these comments do lead to a very desirable improvement of the document.
Great!
We have prepared a pull request at:
https://github.com/core-wg/core-problem-details/pull/40 <https://github.com/core-wg/core-problem-details/pull/40>
I have looked at the pull request. It mostly looks good. See below for
more details.
[snip]
Separate Draft or Not
=====================
I agree with Harald that it should be a separate draft; it would definitely help with visibility of I18N in general and the issue of strings with language and directionality information inside and outside the IETF (not only the visibility within the CBOR community, which may be covered by the tag registry). Being able to say "look at RFC XXXX for a good example" is way better than being able to say "look at appendix X of RFC YYYY for a good example".
Actually, “look at RFC XXXX for a good example” is going to be the outcome of the combined document, because the document not only defines tag 38 (in Appendix A), but also shows a couple examples that use it (in the main body) and even an instance where we decided to unravel it (SPDe -6/-7). So I’m in favor of keeping this document together.
(NOT!) Copying BCP 47 Grammar
=============================
Similarly, XML Schema Datatypes only gives a very simple regular expression ([a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})*) and notes
(see https://www.w3.org/TR/xmlschema11-2/#language <https://www.w3.org/TR/xmlschema11-2/#language>):
[[[[
Note: The regular expression above provides the only normative constraint on the lexical and value spaces of this type. The additional constraints imposed on language identifiers by [BCP 47] and its successor(s), and in particular their requirement that language codes be registered with IANA or ISO if not given in ISO 639, are not part of this datatype as defined here.
]]]]
Again, XML Schema would have done something more precise if anybody had been convinced that such precision made sense.
We tend towards not reading ABNF in RFCs as “The code is more what you'd call 'guidelines' than actual rules” [1].
It's not about 'guidelines' vs. actual rules. It's about preserving the
possibility of future changes and keeping specifications as independent
as possible if that can be done at no or low cost (or as in this
example, actually by reducing potential implementation costs as a side
effect).
But if that is indeed the correct view of BCP47, simplifying the grammar to [a-zA-Z]{1,8}(-[a-zA-Z0-9]{1,8})* certainly is one way of adding flexibility.
➔
https://github.com/core-wg/core-problem-details/pull/40/commits/bbe72e2 <https://github.com/core-wg/core-problem-details/pull/40/commits/bbe72e2>
Great!
Another way to see this is that in general, when giving restricting syntactic rules, there's the question of "bang for the buck". The complexity of the language tag syntax rules, down to the legacy (grandfathered) stuff, mean that the cost ("buck") is quite high. This not only includes implementation and memory footprint, but also testing and everything else.
[…]
Most of the cost for this grammar was paid when RFC 5646 was written.
No, sorry, writing a spec is never the main cost.
Nobody is forced to validate against this grammar.
Yes, but at least some people tend to do so when they see grammar rules
in front of them.
But that is maybe water under the bridge with the above PR.
Yes indeed.
It's weird for the IETF to refer (only) to the Unicode standard here even though the IETF has deprecated this kind of language tagging in RFC 6082. (see https://www.rfc-editor.org/rfc/rfc6082.html <https://www.rfc-editor.org/rfc/rfc6082.html>) So please cite that RFC.
Good point.
Added to PR:
https://github.com/core-wg/core-problem-details/pull/40/commits/a5d900d <https://github.com/core-wg/core-problem-details/pull/40/commits/a5d900d>
Great.
Directionality Information
==========================
is also a technical term in the Bidi Algorithm]
I think this text is very important, so I'll got into some details. First (minor nit), it says "If the third element is absent ...". Because this is in a paragraph that starts with "The optional third element ...", I think it would better say "If this element is absent ...".
Replaced by (a form of) your text…
Great progress, but I think we need a bit more progress, or at least
some more careful checking.
➔
https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9 <https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9>
Next, let me make sure that I get this right: This is a Boolean value, but it can in effect have four different states, yes? That would be:
- True (rtl)
- False (ltr)
- null (no indication about direction, but overriding any context)
- absent (no indication about direction, but context may apply)
If that's true, then it might be good to put that into a more structured from (something like the above list).
Thanks, see below. (A value that is absent is not a value; its representation by a null value may be needed to ~~override~~ reset any context available.)
In one of the patches, you collapsed my four-point list to three points.
I'm still not sure I really get this thing with absent and null. Let's
say we have the following two problem details (very sketchy, obviously
not the right syntax):
First variant
-------------
- problem-details
- title
- lang: de
- text: Das ist ein Titel
[dir absent]
- base-rtl: true
Second variant
--------------
- problem-details
- title
- lang: de
- text: Das ist ein Titel
- dir: null
- base-rtl: true
Here are my questions: Is there a difference between the first and the
second variant? Saying "its representation by a null value may be needed
to reset any context available." seems to suggest that there is a
difference; inheritance ("Das ist ein Titel" being RTL) when absent, and
no inheritance ("Das ist ein Titel" being of undefined directionality)
when null. If that's true, why did you reduce the four choices to three?
If it's not true, why not?
[very major point] The main problem is with the last sentence. There's not much of a point in defining a field for directionality if it's not clear what that is supposed to be used for. I'm also not sure where the claim "the proper processing of Language and Direction Metadata is an active area of investigation" came from, and why it is here.
I believe this statement is rather important, as it does spell out the requirement to stay abreast with the developments in this space. The tag 38 information provides an input to the algorithm that we just need to assume will survive revisions to that algorithm; but the algorithm may be revised.
Do you mean the Unicode Bidirectional Algorithm? It indeed gets reissued
with every new Unicode version, which means roughly once every year.
That's just how the Unicode consortium works, something between "living
standard" and RFCs that are stable as long as nobody has the time to
write an update. But if you look at the substance (going back from
https://www.unicode.org/reports/tr9/tr9-45.html version by version by
changing '45' to lower and lower values), you'll see that there is
exactly one major change, at Unicode Version 6.3 in 2013
(https://www.unicode.org/reports/tr9/tr9-29.html), where isolates (LRI
and RLI) where introduced. And that change was years in the making, with
several talks at Internationalization and Unicode Conferences about the
problems posed by embeddings (LRE and RLE). There's no such change in
site that I'm aware of currently.
So the sentence "Note that the proper processing of Language and
Direction Metadata is an active area of investigation; the reader is
advised to consult ongoing standardization activities such as
[STRING-META] when processing the information represented in this tag."
will produce one of two effects, both highly undesired:
1) An implementer who seriously wants to do the right thing will get
lost in the woods.
2) An implementer inclined to cut corners will just ignore the whole
directionality stuff.
Of course, the right thing for an implementer who just wants to make
sure the text pieces get displayed so that they are easy for a user to
read is
3) to just rely on a bidi library (usually just by sending the right
pieces of text and bidi control characters or markup or whatsoever to
the display engine in the underlying OS or so).
We should make sure that the text of the RFC to be encourages 3), not 1)
or 2).
It is true that some areas of bidi processing (e.g. the best consistent way to display IRIs that contain pieces of text from both directionalities) that are not solved yet, or even (as the example a line ago) are not even actively being investigated because the general agreement is that the problem is too difficult to have a solution.
It is also true that "Strings on the Web: Language and Direction Metadata" (https://www.w3.org/TR/string-meta/ <https://www.w3.org/TR/string-meta/>) is still in Draft status.
But that's not relevant. string-meta isn't a technical spec written for
implementers, it's a meta-spec written for spec writers and similar
folks. It's also describing a very wide range of approaches, while you
have already decided on an approach, because you need an approach, but
not several. You don't want and don't need your implementers to go back
and see what other approaches you may have taken, because they have to
implement the approach that you choose, and no other approach.
Please also note that a meta-spec such as string-meta is usually behind
in the development cycle when compared with real specs. Ideally, it
would be the other way round, but it often takes time and several
examples to figure out what needs to go into a meta-spec. In addition,
language in a meta-spec is more abstract and therefore more difficult to
write. Also, in particular in I18N, the day-to-day business of reviewing
actual specifications always takes precedence, and comes with deadlines
and time pressure (see the example at hand), whereas there
isn't really any deadline for meta-specs.
[Just to avoid any potential confusion: The meta in string-meta is there
because string-meta is discussing information about strings; the meta in
meta-spec is there because a meta-spec discusses stuff about other specs.]
Hence the informative reference.
But neither of these facts should have to influence the specification of Tag 38. [StringMeta] (3.4 What consumers need to do to support direction, https://www.w3.org/TR/string-meta/#what_consumers_do <https://www.w3.org/TR/string-meta/#what_consumers_do>), Harald and I all agree about what the right thing to do is: Use Bidi isolation (in the technical sense of https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates <https://www.unicode.org/reports/tr9/#Explicit_Directional_Isolates>).
So given all the above considerations, what about rewriting the paragraph under consideration along the following lines:
[[[[
The optional third element, if present, is a Boolean value that
indicates a direction, as follows:
- false: LTR direction. The text is expected to be displayed
with LTR base direction if standalone, and isolated with LTR
direction (enclosed in RLI ... PDI or equivalent, see [1]) in
the context of a longer string or text.
- true: RTL direction. The text is expected to be displayed
with LTR base direction if standalone, and isolated with RTL
direction (enclosed in LRI ... PDI or equivalent, see [1]) in
the context of a longer string or text.
- absent: no indication is made about the direction
- (explicit) null: no indication is made about the direction,
but any directionality context applying to this element (e.g.,
base directionality information for an entire CBOR message or
part thereof) is ignored.
]]]]
[1] Unicode® Standard Annex #9, Unicode Bidirectional Algorithm, Section 2.7 Markup and Formatting Characters, https://www.unicode.org/reports/tr9/#Markup_And_Formatting <https://www.unicode.org/reports/tr9/#Markup_And_Formatting>
Thank you; I massaged the text slightly in the above-mentioned PR, i.e.:
➔
https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9 <https://github.com/core-wg/core-problem-details/pull/40/commits/bd588b9>
I'm not really sure yet about the 'absent' and 'null' entries, neither if they are really distinct nor whether the specification is good enough (we might want to specify FIRST STRONG ISOLATE semantics).
We could, but I’m not sure that part of “auto” semantics is as stable as the rest.
In TR #9, the auto semantics is as stable as the others. FSI (first
strong isolate) was introduced in Unicode 6.3 together with the other
isolates. And the "first strong" rule was already present from the start
of the Bidi Algorithm and continues to be there until today for the
overall paragraph direction (see
https://www.unicode.org/reports/tr9/#P2). That also means that you get
exactly these semantics if you just put every Tag 38 text on its own
line (paragraph) e.g. in Win notepad. That also means that the average
user of an RTL script is familiar with this behavior and what to do if
it doesn't do the right thing.
The first character with strong directionality is often rather random and therefore can lead to surprising results. I would expect implementations to develop stronger heuristics here.
Stronger heuristics might be marginally better, but they are not widely
used, and I don't expect TR #9 to introduce new ones. One clear
advantage of the "first strong" rule is that wrong results are easy to
fix by adding an LRM or RLM character at the start. A fully correct
decision about what base directionality to use to display a given text
requires human understanding.
Regards, Martin.
Grüße, Carsten
[1]: a line from “Pirates of the Caribbean”, spoken by a role whose name always reminds me of Bar BOFs :-)
--
last-call mailing list
last-call@xxxxxxxx
https://www.ietf.org/mailman/listinfo/last-call