[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets) to Proposed Standard

Peter Saint-Andre <stpeter@xxxxxxxxxx> · Sun, 16 Feb 2025 17:05:40 -0700

Hi John,

I think the distinctions and suggestions you offer here could lead us in 
a positive direction.

Revisiting draft-bray-unichars just now, I notice that Section 5 ("Using 
Subsets") provides a sort of applicability statement. In particular, it 
talks about "well-known data formats such as JSON, I-JSON, CBOR, YAML, 
and XML" and states:

   A protocol based on JSON can be made more robust and implementor-
   friendly by restricting the contents of object member names and
   string values to one of the subsets described in Section 4.
   Equivalent restrictions are possible for other packaging formats such
   as I-JSON, XML, YAML, and CBOR.

As I see it, a "packaging format" (seeimingly equivalent to "container 
format") provides a context for structured data, but that data (say, 
object member names, as mentioned above) can be quite different from the 
"addresses and identifiers" mentioned in the IDNA and PRECIS specs. 
Pursuing the packaging analogy, we could say that the IDNA/PRECIS specs 
try to provide rules for the addresses that go on the outside of the box 
so that the package gets delivered to the intended recipient, whereas 
specs for JSON and XML and other packaging formats try to provide rules 
for the stuff that goes inside the box.

If this line of thinking is valid, then it seems that the focus of this 
document is quite different from the focus of the IDNA/PRECIS specs. 
Your distinction between processing and transmission might help here 
(although one could argue that object member names and arbitrary string 
values get processed, too). The example of XMPP might be instructive, 
since it is a streaming XML protocol: it uses IDNA and PRECIS to 
properly construct and compare addresses and identifiers for purposes of 
application-layer routing (etc.), but also uses XML as a packaging 
format or container format for all sorts of structured data related to 
buddy list management, data syndication, session management, etc. In the 
somewhat vague language of RFC 8264, the addresses and identifiers go 
into protocol slots whereas other sorts of data go into the "contents" 
of the protocol interactions.

Perhaps we can find clearer, more definitive terminology for these two 
different sorts of things.

Peter

On 2/16/25 4:03 PM, John C Klensin wrote:
Tim,

Thanks for this note.  It gives me a better understanding of your
motivation, something that, IMO, should be more clear in the
document.  Let me make some observations that may clarify the
difference in perspective for others and that might suggest some
adjustments in the document.  You (and the community) may not go
where this note suggests, but it might be worth considering as a
perspective.   It might also help with details about the container /
protocol slot distinction Peter asked about.  Or not.

I assume you and Paul know everything in the next few paragraphs, but
for context and for others evaluating the document and the various
reviews...

I think it is clear, at least to anyone who has spent even a bit of
time studying the issues, that composing strings for others to use
(especially in comparison with things they already have or know)
presents huge opportunities for mishap, mischief, or worse. We
usually assume that identifiers are more susceptible to problems than
free-text strings but maybe that is just because perceiving and using
one identifier when another was intended is likely to have worse
consequences than "just" misreading a string.  Homographs are the
most publicized of the problems, but, depending on the character set
encoding and choices of display type families ("fonts"), anyone who
has confused a digit "1" character with a lower-case "L" should
understand they can occur even if  the repertoire is limited to
ASCII.  Whether homographs are more important than other potential
problems is debatable, but the outcome of that debate depends on
circumstances.

PRECIS, IDNA2008 (for the rather specific IDN case), the ICANN MSR
and LGR work (for even more specific cases), and parts of the W3C and
Unicode specs are all focused on minimizing the risks of those
problems.  Put more positively, those specs are about increasing the
odds that a string chosen by a naive person for use as an identifier
will perform well and that text strings transmitted by one human to
another will be seen, heard, and understood as intended.   As with
the trivial "1"/"l" problem (or opportunity from malicious behavior),
no collection of rules or list of code points is going to provide
complete protection.  A certain amount of educated human judgment is
going to be needed and, yes, that is going to require more effort and
understanding than a simple, minimal, list of code points to be
avoided... just as learning to competently use a different language
or writing system does.

In the terms of your note below, is "better than default" useful?
Perhaps, but we need to ask the question, as to whether, by
introducing even more options and, more important, giving people a
good excuse to avoid more focused specifications in cases where those
specifications are likely to be important, it wouldn't cause more
harm than good.

Is that going to be hard for some people?  Yes.  But routing is hard
and we don't tell people setting up routing procedures "oh, just let
it go, ignore those complex rules about packet formats and their
meaning, put the packets out there and hope for the best".    Nor do
we consider simple substitution cyphers appropriate where
confidentiality is needed even though they are lots easier for
non-experts to understand.  Interoperability is often hard.  The IETF
has usually decided it is worth it.

The above is the argument for the more complex procedures of those
other specs and even for easier measures like doing case folding and
normalization on string comparison when those are appropriate and
used with the understanding that there are edge cases and that they
are not complete solutions to all possible issues.   It that sounds
like I'm skeptical whether, in actual practice, there are such things
as "opaque record keys or URIs or enum values" and whether, if there
were, whether even the level of protection these named subsets would
offer would be worth the trouble, yes, that is probably the case.

While all of the above are arguments against publishing this document
in its present form, there is another way to look at the problem. It
is the way that, of necessity, programming languages and some data
representation languages look at things.  XML, for example, cannot
have a PRECIS-style set of restrictions on the characters that can be
represented.  If it did, we wouldn't be able to write an RFC that
described and illustrated some of the problem cases except, maybe, as
images.  Similarly, for the network, we could adopt something of a
"recipient (or string user, interpreter, etc.) beware" model in which
the only job of a network protocol or data structure is simply to
ensure that whatever goes in comes out again.  As the I-D more or
less points out, where Unicode is involved that requires more than
simply storing (or sending) the bits.  It involves avoiding a
collection of invalid code points and sequences at the coding level
because, if they are stored and then interpreted by something other
than the originating system(s), something odd might reasonably be
expected to happen. Viewed from that perspective, the I-D does a
fairly good job of identifying and explaining the most important
problem cases.

That distinction could be described as the difference between strings
that are appropriate for processing and strings that are only
appropriate for transmission.  In that context, I'm not sure what a
"container format" is and whether that help. My comment about too
many standards may be part of an argument against people inventing
their own rules on a per-protocol, or similar, basis even if we could
magically make that easy.

While I will never, personally, be a big fan of "recipient beware" (a
generalization of "buyer beware"), if the focus is on what can be
transmitted or stored safely with the assumption that the specific
language and script issues with which PRECIS and many W3C specs are
concerned are Someone Else's Problem, then a modified version of the
document probably has value.  What modifications would that require?

* Make it very clear that this spec is about transmission and storage
and that it does not address the topics and issues needed to _use_
the character strings (especially if they might be treated as
identifiers) or even what should be stored or transmitted, only what
is need to safely transmit and store them.

* Get rid of the PRECIS profiles.  PRECIS addresses a different
problem and the main arguments against these subsets as PRECIS
profiles are that they are not consistent with solutions to that
problem as well as violating the direction against addition of new
profiles without strong justification (and, I would add,
justification in the PRECIS context).  Saying something about the
difference between the PRECIS (e.g.) topic area and this spec would
help answer the question of why this spec is useful.  That does not
mean you should not mention PRECIS, nor that we should avoid
considering a reference to this spec for mention in a future PRECIS
revision or update.  Indeed, while it would strike me as a bit
strange, I would not have a strong objection to this document
updating PRECIS to specify the cross-reference.

* Put in an explicit warning that these subsets/ profiles may not be
sufficient if strings are going to be compared or matched to others
and perhaps not even if they are to be displayed if having the reader
see what the originator intended is important... unless information
is transmitted along with them that permits the recipient system or
reader to sort those things out.  The latter is important and may
deserve mention: I don't think it is where you want to go but, if the
document specified that any string transmitted using one of its
subsets/ profiles must be transmitted along with language, locale,
and directionality information, it would at least reduce the
perception of risk.

* Include some pointers, starting with one to PRECIS, about more
processing-oriented specs for those that need them.

* Rethink having three subsets rather than one and/or explain better
why one would want to choose any but the most restrictive.  The fact
that some programming, data structure, or specification language
allows things is not good enough: for Internet purposes, they might
just be wrong.

* For private use codepoints, if the conclusion is to retain them as
allowed, add some text strongly suggesting that any protocol (or
"container") that allows such codepoints make provision for
identifying the agreement that specifies their interpretation.  I
could even see an IANA registry of profiles for private-use code
point collections unless those, somehow, needed to be secret.

Finally and FWIW, I don't buy "they don't seem to be adopting PRECIS"
as a helpful argument in this case because an equally good
explanation for that behavior is that we have not done a good enough
job of explaining why PRECIS (or specs with even more specific
detail) are important.  Instead, we have said things that sound to
the naive or lazy reader like "it is ok to use whatever XML, JSON,
etc., allow" with the implied "why go to the extra work".  If the
IESG were to require --as Section 6 of RFC 2277 (BCP 18) specifies --
that every specification involving the use of non-ASCII characters
contain an "Internationalization Considerations" section and, if the
spec did not point to PRECIS, that it explained why not and,
presumably, what was being done instead, we would probably be seeing
a different adoption pattern or would understand why not.

--On Thursday, February 13, 2025 12:46 -0800 Tim Bray
<tbray@xxxxxxxxxxxxxx> wrote:

  Not going to go point-by-point through John's lengthy missive
and the following correspondence, but the following are worth
addressing:

    1. Why Unichars exists

People make new container formats. When they do that, they have to
decide what their character repertoire is.  For a general-purpose
low-level container format, PRECIS is a hard sell; it's
complicated, 43 pages, many references, full of discussions of
history, and contains a bunch of policy choices that people are
reluctant to adopt without understanding. [Example: Unassigned code
points are disallowed.]  I'm not saying it's bad, just that
empirically, people devising new container formats tend not to
adopt it.

Other people build new protocols around JSON or CBOR or whatever
and, unless they think about it, end up willing to accept those
format's default character subsets, which is bad.  However, just
like people devising formats, empirically they don't seem to be
adopting PRECIS.

So, Unichars tries to do two things: First, Offer a succinct
self-contained discussion of the issues with Unicode code points
and why some are "problematic". Second, provide
better-than-default candidate subsets which are easy to understand,
specify, and support, and avoid problematic code points.

As John points out, there are many other discussions of these
issues from various places. But I don't think there's anything
else around that offers these two benefits, and the process of
developing Unichars has not weakened that opinion.

    2. Unichars PRECIS profiles

I actually do see values in the PRECIS profiles.  Specifically, two
things.

First, the 3 Unichars subsets aim to cover the basic things that
people who are designing low-level container formats and protocols
should think about. In that context there are going to be cases
where they quite reasonably won't want to impose PRECIS's two
existing protocols on their producers or consumers, and would be
glad to use Unichars Assignables or some such. But, in those cases,
I think Unichars ought to at least mention the existence of PRECIS
- if they decide that the content of some particular field has a
high chance of being displayed for consumption by humans, they
should really at least look at PRECIS to inform their decisions
about what to do.

The second reason is the mirror image. Someone who hears that PRECIS
constitutes the IETF rules for Unicode, and realize that some
fields are just opaque record keys or URIs or enum values, and
question why they need the cost and complexity of enforcing PRECIS,
it'd be nice if there were a pointer to Unichars in the registry
so they know there's a simpler option.

So on balance I'd keep the registrations. But if I'm in the
minority on this one I will cheerfully yield without much further
argument.

    3. Private Use areas

I searched the correspondence and couldn't find the discussion
behind this one. My recollection is that someone argued strongly
that Unicode says these code points *are considered to be assigned*
(I checked, and indeed it does say that) and that there might well
be scenarios where they are used as intended as part of the
protocol definition, so they shouldn't be seen as problematic.

Once again, this is not a hill I would die on; if consensus is that
PUA code points should be classified as "problematic", OK.

    - T

--
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx