On May 19, 2024 at 6:26:17 PM, Dale R. Worley <worley@xxxxxxxxxxx> wrote:
But in this case, looking at RFC 4627 sec. 2.5, "Strings", it's clear
(though not directly stated) that a JSON string representation will be a
sequence of ASCII characters that represent a sequence of Unicode
characters. So the limitation in this draft to "Unicode characters"
matches what the definition of JSON allows, and as such there is no
subsetting.
4627 has been obsoleted by the current operative specification of JSON, RFC8259 (disclosure: editor), from which:
char = unescaped / escape ( %x22 / ; " quotation mark U+0022 %x5C / ; \ reverse solidus U+005C %x2F / ; / solidus U+002F %x62 / ; b backspace U+0008 %x66 / ; f form feed U+000C %x6E / ; n line feed U+000A %x72 / ; r carriage return U+000D %x74 / ; t tab U+0009 %x75 4HEXDIG ) ; uXXXX U+XXXX escape = %x5C ; \ quotation-mark = %x22 ; " unescaped = %x20-21 / %x23-5B / %x5D-10FFFF
Note the values in “unescaped”. Surrogates, including naked unpaired surrogates, are clearly allowed. Yes, that is damaging and dumb. It’s too late to change it, though, which is why I-JSON exists, see RFC7493 (disclosure: editor), from which:
Object member names, and string values in arrays and object members, MUST NOT include code points that identify Surrogates or Noncharacters as defined by [UNICODE]. This applies both to characters encoded directly in UTF-8 and to those which are escaped; thus, "\uDEAD" is invalid because it is an unpaired surrogate, while "\uD800\uDEAD" would be legal.
-- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx