[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets) to Proposed Standard

John C Klensin <john-ietf@xxxxxxx> · Sun, 09 Feb 2025 13:19:38 -0500

(Sorry -- prior copy sent from wrong address and apparently not
posted to the Last Call list)

Hi.

Summary (Since this Last Call response is intended to be detailed and
hence may be a bit long):

This document is not suitable for approval and publication in its
current form.  It overlaps with a large number of documents from the
IETF and other bodies without adding anything of significance other
than new terminology that is at least as likely to add to confusion
as to help.  It appears to be inconsistent with general IETF
principles about interoperability and is inconsistent with at least
one fundamental principle of PRECIS for new profiles (to which it
adds three). It does draw together and explain a good deal of
material and terminology that, with some editing, could be very
useful to the community but more likely as an Informational /
tutorial document than as a Standards Track document.

   =====================

I have reviewed the current version of this document.  It contains
some material that could be very useful, but other material makes it
inappropriate for approval and publication at this point.  It also
raises a general question/ issue about whether the IETF wants to go
in this direction, a question whose answers could inform a revision or
replacement or possibly make this draft a dead end.  One way to look
at that general issue suggests trying to address the question of
whether or not the specification or some derivative of it will be a
positive contribution to the IETF and the Internet rather than
whether or not it contains details that would require further
revision (topics that are addressed later in this review).   In
particular, it occurs to me that the main concern of the authors
might be to clearly identify and name the permitted Unicode character
collections ("subsets") allowed by particular programming and data
description languages.  That would probably be worthwhile as long as
it steers clear of text that appears to recommend "appropriate
choices" or guidance for selecting one of them for use in  particular
protocols or data formats.

The details below contains three parts: A discussion of the General
Issue mentioned above, Specific comments about the content of this
document, and some Quibbles and nits.   I am prepared to make much
more specific comments about major revisions or possible alternative
documents that utilize significant parts of the material in this I-D
if the IESG and authors would welcome those comments. 

_The General Issue_

There are already many specifications about how to subset or profile
Unicode for various purposes.   Examples (by no means a comprehensive
list) include the following.  The Unicode Consortium has a document
that is now called "Unicode Identifiers and Syntax" [UAX31], the
current version of which describes eleven different sets of rules
(and hence different subsets) for such identifiers (see Section 1.4
of that document).  It is supplemented by a "Unicode Security
Mechanisms" document [UTS39], which is a different security
specification than UTR36, cited in the I-D.  ICANN has a "maximum
repertory" [MSR] and language and/or script-specific subsets [LGR]
for non-ASCII domain names in the root and (at least) the second
level of the DNS.  For simple text rather than identifiers
specifically, W3C has developed (and continues to develop)
language-specific recommendations for various languages, standards
for language (or locale) negotiation, and so on
[W3CSpecdev][W3CLanguage].  And, of course, the IETF has general
guidelines such as RFC 2277, the IDNA2008 collection for domain names
(RFC5890ff), and a few more profiles/ subsets and ways to generate
them specified by PRECIS in RFCs 8264-8266 and three(!) associated
IANA registries.

The collection of data format specifications called out in Section 5
and 6 of the I-D more or less amount to additional standards from
which to choose and, even if they did not raise other issues (see
below), appear to violate the principle discussed in Section 5.1 of
RFC 8264 (the PRECIS base spec).

There are other issues, some of which are not called out by PRECIS
and that should perhaps be addressed in an update to it at some
stage.  An I-D like this one should not make things worse or more
complicated.  For example, for complex scripts and constructions,
simple lists of code points (or "repertoire subsets") aren't good
enough -- one needs rules about the relationships among characters
and character sequences.  Without them, there are considerable
opportunities for getting into trouble with strings that appear to
match (or being perceived as the same) not matching and vice versa.
Some of the ICANN work addresses those issue as does some of the W3C
work.  IDNA2008 attempted to address some of those that went beyond
simple lists of acceptable code points, but the mechanisms chosen
(and carried into PRECIS as "Contextual Rule Required") turned out to
not be good enough and too difficult to maintain as Unicode expanded.
On another dimension, I've been told that Unicode is now making
recommendations based on the "Plane" in which particular code points
fall but, while the document mentions Planes in passing in Section
2.2.3 and 4.3, it fails to identify that work or the reasons for it.

As I trust the IESG knows, there are some old (and rather bad) jokes
about the advantage of standards being that there are so many of them
from which to choose.  To the extent to which we still believe in
interoperability as a fundamental principle of the IETF, many choices
--at least without very specific guidance about how to match the
needs of a protocol or group of protocols to a particular choice--
are an invitation to interoperability problems and, in many cases, to
security ones.

Most of the people who are familiar with the above collection of
specs -- or even those who have studied the collection of one of the
organizations listed -- and who are trying to get good criteria for
strings (understandable and not a security threat) will be able to
make decisions about what to do even if they involve slight
variations from those specs.  Most of those with malicious intent
will too even if they choose to focus on what the documents tell them
not to do.  But, for those purposes, the specs should be based on or
be making normative recommendations, not just describing subsets of
Unicode.  All of the above do that; it is not clear that this I-D
does (more on that below).  Even if it did, the question the
community (and especially the IESG) should be asking is what this
document, and the three new PRECIS profiles, add that outweigh the
costs of additional specifications and too many choices.  I'd have
fewer concerns about this document if it provided a clear answer to
that question and, ideally, a comparison to at least the IETF
collection of Repertoire/ Subset/ Rule documents that would give the
less experienced reader useful guidance of where to turn.  AFAICT, it
does not.

==============
Specific comments about the content of this document:

(1) Most of Sections 1, 2, and at least parts of 3, which together
describe the problem, are quite good although there are some
descriptions that one could quibble about and that probably deserve
further consideration and revision.  Those sections might make a good
standalone document or an addition or appendix to the PRECIS work.
On the other hand, they are probably inappropriate as they stand.  As
a trivial example, private use code points are used, as the document
(and the Unicode specs) indicate, by private agreements among
cooperating parties.  But different cooperating parties may have very
different uses for them and use them differently, making them a
threat to general interoperability.  Therefore saying that they are
not problematic and are reasonable for use in general-purpose Unicode
subsets is, well, problematic.

(2) The three subsets of Section 4 are, especially if being
considered as PRECIS Profiles (see Section 6) probably violate the
basic assumption/ goal of PRECIS, which is to identify sets of
characters/ code points (again "repertoires" or "Unicode Subsets" if
one prefers) that are actually recommended under various, specified,
circumstances.  The first sentence of Section 4 describes these as
"specifying acceptable content", but, as several of the other
documents mentioned in this review explain, simply excluding a few
"problematic" code points is insufficient for a reasonable
recommendation -- better than "set of all Unicode code points" or
"just send valid UTF-8" as mentioned in this draft and elsewhere--
but, in practice, not very much better.  As mentioned above, they
also violate an explicit requirement of PRECIS (Section 5.1 of RFC
8264) that directs against adding additional profiles without clear
justification.  This document does not appear to even attempt to
provide that justification, indeed, the last sentence of the
introductory paragraph to its Section 6 very nearly says that these
are inadequate.

(3) The last paragraph of Section 7, Security Considerations,
indicates that the use of these subsets will make strings that follow
them "less and less susceptible to vulnerabilities"  While that is
true, many of the other documents mentioned above, including PRECIS
itself, strongly suggest that it is misleading in the sense that the
three subsets listed will provide only minimal protection against
accidental security-threatening problems and even less against
attacks.

(4) Another possible distinction is between text strings that are
actually to be used (including processing in any way other than,
e.g., copying) on the Internet and those that are somehow transmitted
without being touched.   Even copying can be fragile for some
strings.  That distinction is important to the proposed PRECIS
profiles because PRECIS is definitely about use and string comparison
in particular.  As soon as one is going to compare strings or
otherwise do i18n processing, multiple issues arise including the
possibility of false positives (or negatives) on comparison, subtle
problems with bidirectional strings, and so on.  The draft appears to
ignore all of those issues.  That not only makes the proposed PRECIS
profiles inappropriate as PRECIS additions but raises security issues
that should, at least, be documented more exactly than handwaving
references to other documents that do address those issues.

=============
Quibbles and nits:
(i) Section 1.1 (notation): Since the \uNNNN form is used later in
the document (albeit in a JSON-related example), it is probably worth
mentioning the Unicode definition of that form alone with the U+NNNN
one, ideally including an explanation of when one or the other is
preferred (or pointing to the relevant section of the Unicode spec
about that).

(ii) Section 2.2.2.2: Given the use by several programming languages
and associated data representation of hex zero as a string
terminator, it is likely to occur in data without programming errors
being involved.  The section should probably call that case out
rather than just saying "including zero".

(iii) While I recommend against registration of these subsets as
PRECIS  profiles above, I believe that I was the first to recommend
that be done during review of an fairly early draft (IIR, maybe even
an AD-requested one perhaps around -03 or -04) when I thought the
spec was going to evolve in a different direction.  That omission
from the acknowledgments calls the care with which that section (and
maybe other parts of the document) were constructed into doubt.

Thanks,
   john

--On Monday, February 3, 2025 16:46 -0800 The IESG
<iesg-secretary@xxxxxxxx> wrote:

> The IESG has received a request from an individual submitter to
> consider the following document: - 'Unicode Character Repertoire
> Subsets'   <draft-bray-unichars-10.txt> as Proposed Standard
> 
> The IESG plans to make a decision in the next few weeks, and
> solicits final comments on this action. Please send substantive
> comments to the last-call@xxxxxxxx mailing lists by 2025-03-03.
> Exceptionally, comments may be sent to iesg@xxxxxxxx instead. In
> either case, please retain the beginning of the Subject line to
> allow automated sorting.

[UAX31] https://unicode.org/reports/tr31/
[UTS39] https://www.unicode.org/reports/tr39/
[MSR] See, e.g.,
https://www.icann.org/en/system/files/files/msr-5-overview-24jun21-en.pdf
[W3CLanguage]
https://www.w3.org/International/typography/gap-analysis/language-matrix.html
[W3CSpecdev] https://www.w3.org/International/i18n-drafts/nav/specdev

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx