[Last-Call] Re: [art] Re: Artart telechat review of draft-ietf-jmap-contacts-09

worley@xxxxxxxxxxx (Dale R. Worley) · Sun, 19 May 2024 21:26:17 -0400

Steffen Nurpmeso <steffen@xxxxxxxxxx> writes:

I don't know what the larger problems might be with
draft-ietf-jmap-contacts-09, but I think there is less trouble with this
particular point than first appears:

> I myself wonder whether that innocent RFC 9553 sentence
>
>   any valid sequence of Unicode characters encoded as a JSON string
>
> excludes surrogates?

It definitely does, because within the Unicode lexicon, a "surrogate" is
a code point, but not a code point that is assigned to a "character".
Thus surrogates are not "characters" and cannot be members of a "valid
sequence of Unicode characters".  I haven't found a really definite
statement of this, but that is clear from both
https://en.wikipedia.org/wiki/Unicode#Architecture_and_terminology and
https://www.unicode.org/versions/Unicode15.1.0/ch02.pdf

Note that Unicode's "character" can be a bit messy.  E.g. "lower case a
with umlaut" can be either a single "character" U+00E4 or two
"characters", U+0061 followed by the combining dieresis U+0308.  Or for
a particularly hairy ligature in one of the Brahmic scripts, see figure
2-3 in the Unicode document I linked to above, which combines no less
than 6 "characters" into one rendered glyph.

> It should, but it then actively changes the
> meaning of "JSON string" to be a dedicated "sub-profile" of what
> "JSON string" normally means, and then to me the sentence is not
> clear enough.

In principle, you don't need to *define* a profile (sub-specification) of
JSON to say e.g. "the thing must be a JSON string encoding of a sequence
of ASCII letters", though of course in that case the set of "things"
*will be* only a subset of JSON string encodings.

But in this case, looking at RFC 4627 sec. 2.5, "Strings", it's clear
(though not directly stated) that a JSON string representation will be a
sequence of ASCII characters that represent a sequence of Unicode
characters.  So the limitation in this draft to "Unicode characters"
matches what the definition of JSON allows, and as such there is no
subsetting.

> This seems not to mean entire grapheme clusters.  And this seems
> to mean to me that the above RFC 9553 meaning is massively
> under-defined, because there are invisible/visible modifiers,
> combining characters and more, most or all all of which will fail
> a simple "isprint" by themselves, so RFC 9553's
>
>   Implementations MUST NOT assume that text values of adjacent
>   properties are processed or displayed as a combined string; for
>   example, the values of a given name component and a surname
>   component may or may not be rendered together.
>
> combined with
>
>   any valid sequence of Unicode characters encoded as a JSON string
>
> does not make sense at all.

I think that's incorrect because there's no requirement that a Unicode
character passes an "isprint" test.  And the Unicode "general category"
attribute for characters/code points has values like "other, control"
and "other, format" that are specified as "characters" but they're not
"printable" in the ordinary sense.  See
https://en.wikipedia.org/wiki/Unicode#General_Category_property

Dale

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx