[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets) to Proposed Standard

John C Klensin <john-ietf@xxxxxxx> · Sun, 16 Feb 2025 18:03:19 -0500

Tim,

Thanks for this note.  It gives me a better understanding of your
motivation, something that, IMO, should be more clear in the
document.  Let me make some observations that may clarify the
difference in perspective for others and that might suggest some
adjustments in the document.  You (and the community) may not go
where this note suggests, but it might be worth considering as a
perspective.   It might also help with details about the container /
protocol slot distinction Peter asked about.  Or not.

I assume you and Paul know everything in the next few paragraphs, but
for context and for others evaluating the document and the various
reviews...

I think it is clear, at least to anyone who has spent even a bit of
time studying the issues, that composing strings for others to use
(especially in comparison with things they already have or know)
presents huge opportunities for mishap, mischief, or worse. We
usually assume that identifiers are more susceptible to problems than
free-text strings but maybe that is just because perceiving and using
one identifier when another was intended is likely to have worse
consequences than "just" misreading a string.  Homographs are the
most publicized of the problems, but, depending on the character set
encoding and choices of display type families ("fonts"), anyone who
has confused a digit "1" character with a lower-case "L" should
understand they can occur even if  the repertoire is limited to
ASCII.  Whether homographs are more important than other potential
problems is debatable, but the outcome of that debate depends on
circumstances.  

PRECIS, IDNA2008 (for the rather specific IDN case), the ICANN MSR
and LGR work (for even more specific cases), and parts of the W3C and
Unicode specs are all focused on minimizing the risks of those
problems.  Put more positively, those specs are about increasing the
odds that a string chosen by a naive person for use as an identifier
will perform well and that text strings transmitted by one human to
another will be seen, heard, and understood as intended.   As with
the trivial "1"/"l" problem (or opportunity from malicious behavior),
no collection of rules or list of code points is going to provide
complete protection.  A certain amount of educated human judgment is
going to be needed and, yes, that is going to require more effort and
understanding than a simple, minimal, list of code points to be
avoided... just as learning to competently use a different language
or writing system does.

In the terms of your note below, is "better than default" useful?
Perhaps, but we need to ask the question, as to whether, by
introducing even more options and, more important, giving people a
good excuse to avoid more focused specifications in cases where those
specifications are likely to be important, it wouldn't cause more
harm than good.

Is that going to be hard for some people?  Yes.  But routing is hard
and we don't tell people setting up routing procedures "oh, just let
it go, ignore those complex rules about packet formats and their
meaning, put the packets out there and hope for the best".    Nor do
we consider simple substitution cyphers appropriate where
confidentiality is needed even though they are lots easier for
non-experts to understand.  Interoperability is often hard.  The IETF
has usually decided it is worth it.

The above is the argument for the more complex procedures of those
other specs and even for easier measures like doing case folding and
normalization on string comparison when those are appropriate and
used with the understanding that there are edge cases and that they
are not complete solutions to all possible issues.   It that sounds
like I'm skeptical whether, in actual practice, there are such things
as "opaque record keys or URIs or enum values" and whether, if there
were, whether even the level of protection these named subsets would
offer would be worth the trouble, yes, that is probably the case.

While all of the above are arguments against publishing this document
in its present form, there is another way to look at the problem. It
is the way that, of necessity, programming languages and some data
representation languages look at things.  XML, for example, cannot
have a PRECIS-style set of restrictions on the characters that can be
represented.  If it did, we wouldn't be able to write an RFC that
described and illustrated some of the problem cases except, maybe, as
images.  Similarly, for the network, we could adopt something of a
"recipient (or string user, interpreter, etc.) beware" model in which
the only job of a network protocol or data structure is simply to
ensure that whatever goes in comes out again.  As the I-D more or
less points out, where Unicode is involved that requires more than
simply storing (or sending) the bits.  It involves avoiding a
collection of invalid code points and sequences at the coding level
because, if they are stored and then interpreted by something other
than the originating system(s), something odd might reasonably be
expected to happen. Viewed from that perspective, the I-D does a
fairly good job of identifying and explaining the most important
problem cases.  

That distinction could be described as the difference between strings
that are appropriate for processing and strings that are only
appropriate for transmission.  In that context, I'm not sure what a
"container format" is and whether that help. My comment about too
many standards may be part of an argument against people inventing
their own rules on a per-protocol, or similar, basis even if we could
magically make that easy.

While I will never, personally, be a big fan of "recipient beware" (a
generalization of "buyer beware"), if the focus is on what can be
transmitted or stored safely with the assumption that the specific
language and script issues with which PRECIS and many W3C specs are
concerned are Someone Else's Problem, then a modified version of the
document probably has value.  What modifications would that require?

* Make it very clear that this spec is about transmission and storage
and that it does not address the topics and issues needed to _use_
the character strings (especially if they might be treated as
identifiers) or even what should be stored or transmitted, only what
is need to safely transmit and store them.

* Get rid of the PRECIS profiles.  PRECIS addresses a different
problem and the main arguments against these subsets as PRECIS
profiles are that they are not consistent with solutions to that
problem as well as violating the direction against addition of new
profiles without strong justification (and, I would add,
justification in the PRECIS context).  Saying something about the
difference between the PRECIS (e.g.) topic area and this spec would
help answer the question of why this spec is useful.  That does not
mean you should not mention PRECIS, nor that we should avoid
considering a reference to this spec for mention in a future PRECIS
revision or update.  Indeed, while it would strike me as a bit
strange, I would not have a strong objection to this document
updating PRECIS to specify the cross-reference.  

* Put in an explicit warning that these subsets/ profiles may not be
sufficient if strings are going to be compared or matched to others
and perhaps not even if they are to be displayed if having the reader
see what the originator intended is important... unless information
is transmitted along with them that permits the recipient system or
reader to sort those things out.  The latter is important and may
deserve mention: I don't think it is where you want to go but, if the
document specified that any string transmitted using one of its
subsets/ profiles must be transmitted along with language, locale,
and directionality information, it would at least reduce the
perception of risk.

* Include some pointers, starting with one to PRECIS, about more
processing-oriented specs for those that need them.

* Rethink having three subsets rather than one and/or explain better
why one would want to choose any but the most restrictive.  The fact
that some programming, data structure, or specification language
allows things is not good enough: for Internet purposes, they might
just be wrong.

* For private use codepoints, if the conclusion is to retain them as
allowed, add some text strongly suggesting that any protocol (or
"container") that allows such codepoints make provision for
identifying the agreement that specifies their interpretation.  I
could even see an IANA registry of profiles for private-use code
point collections unless those, somehow, needed to be secret.

Finally and FWIW, I don't buy "they don't seem to be adopting PRECIS"
as a helpful argument in this case because an equally good
explanation for that behavior is that we have not done a good enough
job of explaining why PRECIS (or specs with even more specific
detail) are important.  Instead, we have said things that sound to
the naive or lazy reader like "it is ok to use whatever XML, JSON,
etc., allow" with the implied "why go to the extra work".  If the
IESG were to require --as Section 6 of RFC 2277 (BCP 18) specifies --
that every specification involving the use of non-ASCII characters
contain an "Internationalization Considerations" section and, if the
spec did not point to PRECIS, that it explained why not and,
presumably, what was being done instead, we would probably be seeing
a different adoption pattern or would understand why not.

--On Thursday, February 13, 2025 12:46 -0800 Tim Bray
<tbray@xxxxxxxxxxxxxx> wrote:

>  Not going to go point-by-point through John's lengthy missive
> and the following correspondence, but the following are worth
> addressing:
> 
> 
>    1. Why Unichars exists
> 
> 
> People make new container formats. When they do that, they have to
> decide what their character repertoire is.  For a general-purpose
> low-level container format, PRECIS is a hard sell; it's
> complicated, 43 pages, many references, full of discussions of
> history, and contains a bunch of policy choices that people are
> reluctant to adopt without understanding. [Example: Unassigned code
> points are disallowed.]  I'm not saying it's bad, just that
> empirically, people devising new container formats tend not to
> adopt it.
> 
> Other people build new protocols around JSON or CBOR or whatever
> and, unless they think about it, end up willing to accept those
> format's default character subsets, which is bad.  However, just
> like people devising formats, empirically they don't seem to be
> adopting PRECIS.
> 
> So, Unichars tries to do two things: First, Offer a succinct
> self-contained discussion of the issues with Unicode code points
> and why some are "problematic". Second, provide
> better-than-default candidate subsets which are easy to understand,
> specify, and support, and avoid problematic code points.
> 
> As John points out, there are many other discussions of these
> issues from various places. But I don't think there's anything
> else around that offers these two benefits, and the process of
> developing Unichars has not weakened that opinion.
> 
> 
>    2. Unichars PRECIS profiles
> 
> 
> I actually do see values in the PRECIS profiles.  Specifically, two
> things.
> 
> First, the 3 Unichars subsets aim to cover the basic things that
> people who are designing low-level container formats and protocols
> should think about. In that context there are going to be cases
> where they quite reasonably won't want to impose PRECIS's two
> existing protocols on their producers or consumers, and would be
> glad to use Unichars Assignables or some such. But, in those cases,
> I think Unichars ought to at least mention the existence of PRECIS
> - if they decide that the content of some particular field has a
> high chance of being displayed for consumption by humans, they
> should really at least look at PRECIS to inform their decisions
> about what to do.
> 
> The second reason is the mirror image. Someone who hears that PRECIS
> constitutes the IETF rules for Unicode, and realize that some
> fields are just opaque record keys or URIs or enum values, and
> question why they need the cost and complexity of enforcing PRECIS,
> it'd be nice if there were a pointer to Unichars in the registry
> so they know there's a simpler option.
> 
> So on balance I'd keep the registrations. But if I'm in the
> minority on this one I will cheerfully yield without much further
> argument.
> 
> 
>    3. Private Use areas
> 
> 
> I searched the correspondence and couldn't find the discussion
> behind this one. My recollection is that someone argued strongly
> that Unicode says these code points *are considered to be assigned*
> (I checked, and indeed it does say that) and that there might well
> be scenarios where they are used as intended as part of the
> protocol definition, so they shouldn't be seen as problematic.
> 
> Once again, this is not a hill I would die on; if consensus is that
> PUA code points should be classified as "problematic", OK.
> 
> 
>    - T
> 
> 
> 
>> 

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx