[Last-Call] Re: [art] Artart telechat review of draft-ietf-jmap-contacts-09

Steffen Nurpmeso <steffen@xxxxxxxxxx> · Sat, 18 May 2024 23:48:08 +0200

Rob Sayre wrote in
 <CAChr6SzbiF_1hnqYQ5YqWVkB-Yfc1zFnPt1YG=i8qkfvZsP-fw@xxxxxxxxxxxxxx>:
 |On Sat, May 18, 2024 at 12:47 PM Rob Sayre <sayrer@xxxxxxxxx> wrote:
 |> RFC 20, section 4.1 (or some update) would do. But, I wonder if that
 |> part is right, because of the way JSContact is defined.
 |>
 |> https://www.rfc-editor.org/rfc/rfc9553.html#name-free-form-text
 |>
 |> That can certainly contain CR, LF, TAB, in common cases. But if you get
 |> BEL or SUB, that may be a problem.
 |
 |Also, of course Tim knows this one, but I forget to mention it. Wouldn't it
 |be nice to refer to a document instead of repeating this stuff?
 |
 |https://www.ietf.org/archive/id/draft-bray-unichars-08.html#name-control\
 |-codes

Quite honestly i think you two create some kind of "Me too"
environment for anyone, myself included, regarding this terrible
draft, which had an interesting IETF session i watched in parts
via web video.  It is simple psycho terror to repeat this over and
over again even thereafter, in my humble opinion.

BEL is not a problem at all if you simply define for example that
"only printable characters should be displayed", you can use some
kind of isprint() series to realize that.

I myself wonder whether that innocent RFC 9553 sentence

  any valid sequence of Unicode characters encoded as a JSON string

excludes surrogates?  It should, but it then actively changes the
meaning of "JSON string" to be a dedicated "sub-profile" of what
"JSON string" normally means, and then to me the sentence is not
clear enough.

Anyhow a Unicode character is, according to Unicode

  Character. (1) The smallest component of written language that
  has semantic value; refers to the abstract meaning and/or shape,
  rather than a specific shape (see also glyph), though in code
  tables some form of visual representation is essential for the
  reader’s understanding. (2) Synonym for abstract character. (3)
  The basic unit of encoding for the Unicode character
  encoding. (4) The English name for the ideographic written
  elements of Chinese origin. [See ideograph (2).]

This seems not to mean entire grapheme clusters.  And this seems
to mean to me that the above RFC 9553 meaning is massively
under-defined, because there are invisible/visible modifiers,
combining characters and more, most or all all of which will fail
a simple "isprint" by themselves, so RFC 9553's

  Implementations MUST NOT assume that text values of adjacent
  properties are processed or displayed as a combined string; for
  example, the values of a given name component and a surname
  component may or may not be rendered together.

combined with

  any valid sequence of Unicode characters encoded as a JSON string

does not make sense at all.
Which is were i see the real problem, not in a BEL or SUB.
And it reminds me of the "define a profil", "define a profil" that
was heard on that IETF session.

Granted the above is anyhow not implementable with normal ISO C or
POSIX tools, one could maximally use iconv(3) to convert the
string, and then "terminate" it via iconv, whatever that means
(placing reset sequence, likely).
You surely have to go to ICU, and you possibly want to [..looking
things up via web..] read

  http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

and then use

  https://unicode-org.github.io/icu/userguide/boundaryanalysis/

to use split at character (grapheme!), word or line boundaries.

There is also

  https://unicode-org.github.io/icu/userguide/strings/properties.html

Btw

  https://unicode-org.github.io/icu/userguide/strings/utext.html

has a nice word break example.

In all this context that leaves programmers completely standing
alone in the rain my gut says that remarks on devilish BELs
compare like "the piper at the gates of dawn" to "the division
bell".

Sorry.

--steffen
|
|Der Kragenbaer,                The moon bear,
|der holt sich munter           he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx