Rob Sayre wrote in <CAChr6SzbiF_1hnqYQ5YqWVkB-Yfc1zFnPt1YG=i8qkfvZsP-fw@xxxxxxxxxxxxxx>: |On Sat, May 18, 2024 at 12:47 PM Rob Sayre <sayrer@xxxxxxxxx> wrote: |> RFC 20, section 4.1 (or some update) would do. But, I wonder if that |> part is right, because of the way JSContact is defined. |> |> https://www.rfc-editor.org/rfc/rfc9553.html#name-free-form-text |> |> That can certainly contain CR, LF, TAB, in common cases. But if you get |> BEL or SUB, that may be a problem. | |Also, of course Tim knows this one, but I forget to mention it. Wouldn't it |be nice to refer to a document instead of repeating this stuff? | |https://www.ietf.org/archive/id/draft-bray-unichars-08.html#name-control\ |-codes Quite honestly i think you two create some kind of "Me too" environment for anyone, myself included, regarding this terrible draft, which had an interesting IETF session i watched in parts via web video. It is simple psycho terror to repeat this over and over again even thereafter, in my humble opinion. BEL is not a problem at all if you simply define for example that "only printable characters should be displayed", you can use some kind of isprint() series to realize that. I myself wonder whether that innocent RFC 9553 sentence any valid sequence of Unicode characters encoded as a JSON string excludes surrogates? It should, but it then actively changes the meaning of "JSON string" to be a dedicated "sub-profile" of what "JSON string" normally means, and then to me the sentence is not clear enough. Anyhow a Unicode character is, according to Unicode Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).] This seems not to mean entire grapheme clusters. And this seems to mean to me that the above RFC 9553 meaning is massively under-defined, because there are invisible/visible modifiers, combining characters and more, most or all all of which will fail a simple "isprint" by themselves, so RFC 9553's Implementations MUST NOT assume that text values of adjacent properties are processed or displayed as a combined string; for example, the values of a given name component and a surname component may or may not be rendered together. combined with any valid sequence of Unicode characters encoded as a JSON string does not make sense at all. Which is were i see the real problem, not in a BEL or SUB. And it reminds me of the "define a profil", "define a profil" that was heard on that IETF session. Granted the above is anyhow not implementable with normal ISO C or POSIX tools, one could maximally use iconv(3) to convert the string, and then "terminate" it via iconv, whatever that means (placing reset sequence, likely). You surely have to go to ICU, and you possibly want to [..looking things up via web..] read http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries and then use https://unicode-org.github.io/icu/userguide/boundaryanalysis/ to use split at character (grapheme!), word or line boundaries. There is also https://unicode-org.github.io/icu/userguide/strings/properties.html Btw https://unicode-org.github.io/icu/userguide/strings/utext.html has a nice word break example. In all this context that leaves programmers completely standing alone in the rain my gut says that remarks on devilish BELs compare like "the piper at the gates of dawn" to "the division bell". Sorry. --steffen | |Der Kragenbaer, The moon bear, |der holt sich munter he cheerfully and one by one |einen nach dem anderen runter wa.ks himself off |(By Robert Gernhardt) -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx