[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets) to Proposed Standard

Tim Bray <tbray@xxxxxxxxxxxxxx> · Thu, 13 Feb 2025 12:46:20 -0800

    Not going to go point-by-point through John’s lengthy missive and the following correspondence, but the following are worth addressing:

Why Unichars exists

People make new container formats. When they do that, they have to decide what their character repertoire is.  For a general-purpose low-level container format, PRECIS is a hard sell; it’s complicated, 43 pages, many references, full of discussions of history, and contains a bunch of policy choices that people are reluctant to adopt without understanding. [Example: Unassigned code points are disallowed.]  I’m not saying it’s bad, just that empirically, people devising new container formats tend not to adopt it.  

Other people build new protocols around JSON or CBOR or whatever and, unless they think about it, end up willing to accept those format’s default character subsets, which is bad.  However, just like people devising formats, empirically they don’t seem to be adopting PRECIS.

So, Unichars tries to do two things: First, Offer a succinct self-contained discussion of the issues with Unicode code points and why some are “problematic”. Second, provide better-than-default candidate subsets which are easy to understand, specify, and support, and avoid problematic code points.

As John points out, there are many other discussions of these issues from various places. But I don’t think there’s anything else around that offers these two benefits, and the process of developing Unichars has not weakened that opinion.

Unichars PRECIS profiles

I actually do see values in the PRECIS profiles.  Specifically, two things.

First, the 3 Unichars subsets aim to cover the basic things that people who are designing low-level container formats and protocols should think about. In that context there are going to be cases where they quite reasonably won’t want to impose PRECIS’s two existing protocols on their producers or consumers, and would be glad to use Unichars Assignables or some such. But, in those cases, I think Unichars ought to at least mention the existence of PRECIS - if they decide that the content of some particular field has a high chance of being displayed for consumption by humans, they should really at least look at PRECIS to inform their decisions about what to do.

The second reason is the mirror image. Someone who hears that PRECIS constitutes the IETF rules for Unicode, and realize that some fields are just opaque record keys or URIs or enum values, and question why they need the cost and complexity of enforcing PRECIS, it’d be nice if there were a pointer to Unichars in the registry so they know there’s a simpler option.

So on balance I’d keep the registrations. But if I’m in the minority on this one I will cheerfully yield without much further argument.

Private Use areas

I searched the correspondence and couldn't find the discussion behind this one. My recollection is that someone argued strongly that Unicode says these code points are considered to be assigned (I checked, and indeed it does say that) and that there might well be scenarios where they are used as intended as part of the protocol definition, so they shouldn’t be seen as problematic. 

Once again, this is not a hill I would die on; if consensus is that PUA code points should be classified as “problematic”, OK.

T

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx