Tim, Thanks for this note. It gives me a better understanding of your motivation, something that, IMO, should be more clear in the document. Let me make some observations that may clarify the difference in perspective for others and that might suggest some adjustments in the document. You (and the community) may not go where this note suggests, but it might be worth considering as a perspective. It might also help with details about the container / protocol slot distinction Peter asked about. Or not. I assume you and Paul know everything in the next few paragraphs, but for context and for others evaluating the document and the various reviews... I think it is clear, at least to anyone who has spent even a bit of time studying the issues, that composing strings for others to use (especially in comparison with things they already have or know) presents huge opportunities for mishap, mischief, or worse. We usually assume that identifiers are more susceptible to problems than free-text strings but maybe that is just because perceiving and using one identifier when another was intended is likely to have worse consequences than "just" misreading a string. Homographs are the most publicized of the problems, but, depending on the character set encoding and choices of display type families ("fonts"), anyone who has confused a digit "1" character with a lower-case "L" should understand they can occur even if the repertoire is limited to ASCII. Whether homographs are more important than other potential problems is debatable, but the outcome of that debate depends on circumstances. PRECIS, IDNA2008 (for the rather specific IDN case), the ICANN MSR and LGR work (for even more specific cases), and parts of the W3C and Unicode specs are all focused on minimizing the risks of those problems. Put more positively, those specs are about increasing the odds that a string chosen by a naive person for use as an identifier will perform well and that text strings transmitted by one human to another will be seen, heard, and understood as intended. As with the trivial "1"/"l" problem (or opportunity from malicious behavior), no collection of rules or list of code points is going to provide complete protection. A certain amount of educated human judgment is going to be needed and, yes, that is going to require more effort and understanding than a simple, minimal, list of code points to be avoided... just as learning to competently use a different language or writing system does. In the terms of your note below, is "better than default" useful? Perhaps, but we need to ask the question, as to whether, by introducing even more options and, more important, giving people a good excuse to avoid more focused specifications in cases where those specifications are likely to be important, it wouldn't cause more harm than good. Is that going to be hard for some people? Yes. But routing is hard and we don't tell people setting up routing procedures "oh, just let it go, ignore those complex rules about packet formats and their meaning, put the packets out there and hope for the best". Nor do we consider simple substitution cyphers appropriate where confidentiality is needed even though they are lots easier for non-experts to understand. Interoperability is often hard. The IETF has usually decided it is worth it. The above is the argument for the more complex procedures of those other specs and even for easier measures like doing case folding and normalization on string comparison when those are appropriate and used with the understanding that there are edge cases and that they are not complete solutions to all possible issues. It that sounds like I'm skeptical whether, in actual practice, there are such things as "opaque record keys or URIs or enum values" and whether, if there were, whether even the level of protection these named subsets would offer would be worth the trouble, yes, that is probably the case. While all of the above are arguments against publishing this document in its present form, there is another way to look at the problem. It is the way that, of necessity, programming languages and some data representation languages look at things. XML, for example, cannot have a PRECIS-style set of restrictions on the characters that can be represented. If it did, we wouldn't be able to write an RFC that described and illustrated some of the problem cases except, maybe, as images. Similarly, for the network, we could adopt something of a "recipient (or string user, interpreter, etc.) beware" model in which the only job of a network protocol or data structure is simply to ensure that whatever goes in comes out again. As the I-D more or less points out, where Unicode is involved that requires more than simply storing (or sending) the bits. It involves avoiding a collection of invalid code points and sequences at the coding level because, if they are stored and then interpreted by something other than the originating system(s), something odd might reasonably be expected to happen. Viewed from that perspective, the I-D does a fairly good job of identifying and explaining the most important problem cases. That distinction could be described as the difference between strings that are appropriate for processing and strings that are only appropriate for transmission. In that context, I'm not sure what a "container format" is and whether that help. My comment about too many standards may be part of an argument against people inventing their own rules on a per-protocol, or similar, basis even if we could magically make that easy. While I will never, personally, be a big fan of "recipient beware" (a generalization of "buyer beware"), if the focus is on what can be transmitted or stored safely with the assumption that the specific language and script issues with which PRECIS and many W3C specs are concerned are Someone Else's Problem, then a modified version of the document probably has value. What modifications would that require? * Make it very clear that this spec is about transmission and storage and that it does not address the topics and issues needed to _use_ the character strings (especially if they might be treated as identifiers) or even what should be stored or transmitted, only what is need to safely transmit and store them. * Get rid of the PRECIS profiles. PRECIS addresses a different problem and the main arguments against these subsets as PRECIS profiles are that they are not consistent with solutions to that problem as well as violating the direction against addition of new profiles without strong justification (and, I would add, justification in the PRECIS context). Saying something about the difference between the PRECIS (e.g.) topic area and this spec would help answer the question of why this spec is useful. That does not mean you should not mention PRECIS, nor that we should avoid considering a reference to this spec for mention in a future PRECIS revision or update. Indeed, while it would strike me as a bit strange, I would not have a strong objection to this document updating PRECIS to specify the cross-reference. * Put in an explicit warning that these subsets/ profiles may not be sufficient if strings are going to be compared or matched to others and perhaps not even if they are to be displayed if having the reader see what the originator intended is important... unless information is transmitted along with them that permits the recipient system or reader to sort those things out. The latter is important and may deserve mention: I don't think it is where you want to go but, if the document specified that any string transmitted using one of its subsets/ profiles must be transmitted along with language, locale, and directionality information, it would at least reduce the perception of risk. * Include some pointers, starting with one to PRECIS, about more processing-oriented specs for those that need them. * Rethink having three subsets rather than one and/or explain better why one would want to choose any but the most restrictive. The fact that some programming, data structure, or specification language allows things is not good enough: for Internet purposes, they might just be wrong. * For private use codepoints, if the conclusion is to retain them as allowed, add some text strongly suggesting that any protocol (or "container") that allows such codepoints make provision for identifying the agreement that specifies their interpretation. I could even see an IANA registry of profiles for private-use code point collections unless those, somehow, needed to be secret. Finally and FWIW, I don't buy "they don't seem to be adopting PRECIS" as a helpful argument in this case because an equally good explanation for that behavior is that we have not done a good enough job of explaining why PRECIS (or specs with even more specific detail) are important. Instead, we have said things that sound to the naive or lazy reader like "it is ok to use whatever XML, JSON, etc., allow" with the implied "why go to the extra work". If the IESG were to require --as Section 6 of RFC 2277 (BCP 18) specifies -- that every specification involving the use of non-ASCII characters contain an "Internationalization Considerations" section and, if the spec did not point to PRECIS, that it explained why not and, presumably, what was being done instead, we would probably be seeing a different adoption pattern or would understand why not. --On Thursday, February 13, 2025 12:46 -0800 Tim Bray <tbray@xxxxxxxxxxxxxx> wrote: > Not going to go point-by-point through John's lengthy missive > and the following correspondence, but the following are worth > addressing: > > > 1. Why Unichars exists > > > People make new container formats. When they do that, they have to > decide what their character repertoire is. For a general-purpose > low-level container format, PRECIS is a hard sell; it's > complicated, 43 pages, many references, full of discussions of > history, and contains a bunch of policy choices that people are > reluctant to adopt without understanding. [Example: Unassigned code > points are disallowed.] I'm not saying it's bad, just that > empirically, people devising new container formats tend not to > adopt it. > > Other people build new protocols around JSON or CBOR or whatever > and, unless they think about it, end up willing to accept those > format's default character subsets, which is bad. However, just > like people devising formats, empirically they don't seem to be > adopting PRECIS. > > So, Unichars tries to do two things: First, Offer a succinct > self-contained discussion of the issues with Unicode code points > and why some are "problematic". Second, provide > better-than-default candidate subsets which are easy to understand, > specify, and support, and avoid problematic code points. > > As John points out, there are many other discussions of these > issues from various places. But I don't think there's anything > else around that offers these two benefits, and the process of > developing Unichars has not weakened that opinion. > > > 2. Unichars PRECIS profiles > > > I actually do see values in the PRECIS profiles. Specifically, two > things. > > First, the 3 Unichars subsets aim to cover the basic things that > people who are designing low-level container formats and protocols > should think about. In that context there are going to be cases > where they quite reasonably won't want to impose PRECIS's two > existing protocols on their producers or consumers, and would be > glad to use Unichars Assignables or some such. But, in those cases, > I think Unichars ought to at least mention the existence of PRECIS > - if they decide that the content of some particular field has a > high chance of being displayed for consumption by humans, they > should really at least look at PRECIS to inform their decisions > about what to do. > > The second reason is the mirror image. Someone who hears that PRECIS > constitutes the IETF rules for Unicode, and realize that some > fields are just opaque record keys or URIs or enum values, and > question why they need the cost and complexity of enforcing PRECIS, > it'd be nice if there were a pointer to Unichars in the registry > so they know there's a simpler option. > > So on balance I'd keep the registrations. But if I'm in the > minority on this one I will cheerfully yield without much further > argument. > > > 3. Private Use areas > > > I searched the correspondence and couldn't find the discussion > behind this one. My recollection is that someone argued strongly > that Unicode says these code points *are considered to be assigned* > (I checked, and indeed it does say that) and that there might well > be scenarios where they are used as intended as part of the > protocol definition, so they shouldn't be seen as problematic. > > Once again, this is not a hill I would die on; if consensus is that > PUA code points should be classified as "problematic", OK. > > > - T > > > >> -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx