[Last-Call] Re: [art] Artart telechat review of draft-ietf-jmap-contacts-09

Steffen Nurpmeso <steffen@xxxxxxxxxx> · Mon, 20 May 2024 22:12:10 +0200

Hello.

Dale R. Worley wrote in
 <87ikz9e0jq.fsf@xxxxxxxxxxxxxxxxxxxxx>:
 |Steffen Nurpmeso <steffen@xxxxxxxxxx> writes:
 |
 |I don't know what the larger problems might be with
 |draft-ietf-jmap-contacts-09, but I think there is less trouble with this
 |particular point than first appears:
 |
 |> I myself wonder whether that innocent RFC 9553 sentence
 |>
 |>   any valid sequence of Unicode characters encoded as a JSON string
 |>
 |> excludes surrogates?
 |
 |It definitely does, because within the Unicode lexicon, a "surrogate" is
 |a code point, but not a code point that is assigned to a "character".
 |Thus surrogates are not "characters" and cannot be members of a "valid
 |sequence of Unicode characters".  I haven't found a really definite
 |statement of this, but that is clear from both
 |https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology and
 |https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf

But wait!
A surrogate is valid Unicode when "unfolded" to the plain Unicode
code point it was before becoming a surrogate.
Nothing in the words of this RFC prevents anyone from taking the
valid Unicode string and stuffing it surrogatized into the JSON
string which allows exactly this surrogate syntax.

 |Note that Unicode's "character" can be a bit messy.  E.g. "lower case a
 |with umlaut" can be either a single "character" U+00E4 or two
 |"characters", U+0061 followed by the combining dieresis U+0308.  Or for
 |a particularly hairy ligature in one of the Brahmic scripts, see figure
 |2-3 in the Unicode document I linked to above, which combines no less
 |than 6 "characters" into one rendered glyph.

Decomposing, normalizing. etc.

 |> It should, but it then actively changes the
 |> meaning of "JSON string" to be a dedicated "sub-profile" of what
 |> "JSON string" normally means, and then to me the sentence is not
 |> clear enough.
 |
 |In principle, you don't need to *define* a profile (sub-specification) of
 |JSON to say e.g. "the thing must be a JSON string encoding of a sequence
 |of ASCII letters", though of course in that case the set of "things"
 |*will be* only a subset of JSON string encodings.
 |
 |But in this case, looking at RFC 4627 sec. 2.5, "Strings", it's clear
 |(though not directly stated) that a JSON string representation will be a
 |sequence of ASCII characters that represent a sequence of Unicode
 |characters.  So the limitation in this draft to "Unicode characters"
 |matches what the definition of JSON allows, and as such there is no
 |subsetting.
 |
 |> This seems not to mean entire grapheme clusters.  And this seems
 ...
 |> does not make sense at all.
 |
 |I think that's incorrect because there's no requirement that a Unicode
 |character passes an "isprint" test.  And the Unicode "general category"
 |attribute for characters/code points has values like "other, control"
 |and "other, format" that are specified as "characters" but they're not
 |"printable" in the ordinary sense.  See
 |https://en.wikipedia.org/wiki/Unicode#General_Category_property

To be very honest, i will now tell you what has happened.
I have not idea.
But i tell you what has happened.
It is only a fiction, mind you.

So back when this RFC has been developed, there suddenly appeared
that BIDI (bi-directional text) security advisory all over the
software world, in compilers for example, but also text editors --
everywhere!  (To recall, via directional Unicode controls a user
would see a visual sentence "A", but the software would first work
on a sentence "B" that was "bytewise first".)

So now the IETF started squealing!, lost its towel!, and then
started running -- nude as it was!! -- to the Unicode consortium,
that, even though commercial, in practice, different to the IETF,
to which money is the rust on its noble sentiment, has the
necessary competence, because it has character set experts which
designed this over thirty years.  (And most of the elder are still
in there...)

Ie i think of it as either [1] or [2], or .. even both!!
(Both!  Both!!)

  [1] https://www.youtube.com/watch?v=jVWDNq558AM
  [2] https://www.youtube.com/watch?v=5U319VzSqEU

This resulted in the following text

  1.6.1.  Free-Form Text

     Properties having free-form text values MAY contain any valid
     sequence of Unicode characters encoded as a JSON string.  Such values
==start of BIDI
     can contain unidirectional left-to-right and right-to-left text, as
     well as bidirectional text using Unicode Directional Formatting
     Characters as described in Section 2 of [UBiDi].  Implementations
     setting bidirectional text MUST make sure that each property value
     complies with the requirements of the Unicode Bidirectional
     Algorithm.  Implementations MUST NOT assume that text values of
==end of BIDI
     adjacent properties are processed or displayed as a combined string;
     for example, the values of a given name component and a surname
     component may or may not be rendered together.

As can be seen, even though there is such a wide area of
problematic fields, say, for example, control characters and their
misinterpretation (the devilish C0 control characters!), but,
well, there are many more, indeed, the JSContact RFC definition
covers only a small fraction of it.

My very personal view on all this is plain, the IETF should keep
its hand off Unicode.  This started with the IDNA that i "hate",
and, eh, goes on.  If you mean "it can be Unicode text", then just
refer to Unicode.  Find some definition that is "complete",
meaning grapheme or word boundary, make that an RFC maybe, and
then only point to that.

And if you do not want control characters, not even the visual
representation that Unicode has for control characters (just add
U+2400 for C0 controls), then define some "printable" meaning, and
use that.

P.S.: as far as i know most of the BIDI-vulnerable software stacks
(compilers, text editors, etc) do still not comply to the very
complicated (last i looked) Unicode BIDI algorithm, but they only
track the directional attribute of code points, and the general
directional marks, and count character cells.  (This, i would
think, is not possible with ISO C alone.)

 |Dale
 --End of <87ikz9e0jq.fsf@xxxxxxxxxxxxxxxxxxxxx>

Ciao from Germany,

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx