(Sorry -- prior copy sent from wrong address and apparently not posted to the Last Call list) Hi. Summary (Since this Last Call response is intended to be detailed and hence may be a bit long): This document is not suitable for approval and publication in its current form. It overlaps with a large number of documents from the IETF and other bodies without adding anything of significance other than new terminology that is at least as likely to add to confusion as to help. It appears to be inconsistent with general IETF principles about interoperability and is inconsistent with at least one fundamental principle of PRECIS for new profiles (to which it adds three). It does draw together and explain a good deal of material and terminology that, with some editing, could be very useful to the community but more likely as an Informational / tutorial document than as a Standards Track document. ===================== I have reviewed the current version of this document. It contains some material that could be very useful, but other material makes it inappropriate for approval and publication at this point. It also raises a general question/ issue about whether the IETF wants to go in this direction, a question whose answers could inform a revision or replacement or possibly make this draft a dead end. One way to look at that general issue suggests trying to address the question of whether or not the specification or some derivative of it will be a positive contribution to the IETF and the Internet rather than whether or not it contains details that would require further revision (topics that are addressed later in this review). In particular, it occurs to me that the main concern of the authors might be to clearly identify and name the permitted Unicode character collections ("subsets") allowed by particular programming and data description languages. That would probably be worthwhile as long as it steers clear of text that appears to recommend "appropriate choices" or guidance for selecting one of them for use in particular protocols or data formats. The details below contains three parts: A discussion of the General Issue mentioned above, Specific comments about the content of this document, and some Quibbles and nits. I am prepared to make much more specific comments about major revisions or possible alternative documents that utilize significant parts of the material in this I-D if the IESG and authors would welcome those comments. _The General Issue_ There are already many specifications about how to subset or profile Unicode for various purposes. Examples (by no means a comprehensive list) include the following. The Unicode Consortium has a document that is now called "Unicode Identifiers and Syntax" [UAX31], the current version of which describes eleven different sets of rules (and hence different subsets) for such identifiers (see Section 1.4 of that document). It is supplemented by a "Unicode Security Mechanisms" document [UTS39], which is a different security specification than UTR36, cited in the I-D. ICANN has a "maximum repertory" [MSR] and language and/or script-specific subsets [LGR] for non-ASCII domain names in the root and (at least) the second level of the DNS. For simple text rather than identifiers specifically, W3C has developed (and continues to develop) language-specific recommendations for various languages, standards for language (or locale) negotiation, and so on [W3CSpecdev][W3CLanguage]. And, of course, the IETF has general guidelines such as RFC 2277, the IDNA2008 collection for domain names (RFC5890ff), and a few more profiles/ subsets and ways to generate them specified by PRECIS in RFCs 8264-8266 and three(!) associated IANA registries. The collection of data format specifications called out in Section 5 and 6 of the I-D more or less amount to additional standards from which to choose and, even if they did not raise other issues (see below), appear to violate the principle discussed in Section 5.1 of RFC 8264 (the PRECIS base spec). There are other issues, some of which are not called out by PRECIS and that should perhaps be addressed in an update to it at some stage. An I-D like this one should not make things worse or more complicated. For example, for complex scripts and constructions, simple lists of code points (or "repertoire subsets") aren't good enough -- one needs rules about the relationships among characters and character sequences. Without them, there are considerable opportunities for getting into trouble with strings that appear to match (or being perceived as the same) not matching and vice versa. Some of the ICANN work addresses those issue as does some of the W3C work. IDNA2008 attempted to address some of those that went beyond simple lists of acceptable code points, but the mechanisms chosen (and carried into PRECIS as "Contextual Rule Required") turned out to not be good enough and too difficult to maintain as Unicode expanded. On another dimension, I've been told that Unicode is now making recommendations based on the "Plane" in which particular code points fall but, while the document mentions Planes in passing in Section 2.2.3 and 4.3, it fails to identify that work or the reasons for it. As I trust the IESG knows, there are some old (and rather bad) jokes about the advantage of standards being that there are so many of them from which to choose. To the extent to which we still believe in interoperability as a fundamental principle of the IETF, many choices --at least without very specific guidance about how to match the needs of a protocol or group of protocols to a particular choice-- are an invitation to interoperability problems and, in many cases, to security ones. Most of the people who are familiar with the above collection of specs -- or even those who have studied the collection of one of the organizations listed -- and who are trying to get good criteria for strings (understandable and not a security threat) will be able to make decisions about what to do even if they involve slight variations from those specs. Most of those with malicious intent will too even if they choose to focus on what the documents tell them not to do. But, for those purposes, the specs should be based on or be making normative recommendations, not just describing subsets of Unicode. All of the above do that; it is not clear that this I-D does (more on that below). Even if it did, the question the community (and especially the IESG) should be asking is what this document, and the three new PRECIS profiles, add that outweigh the costs of additional specifications and too many choices. I'd have fewer concerns about this document if it provided a clear answer to that question and, ideally, a comparison to at least the IETF collection of Repertoire/ Subset/ Rule documents that would give the less experienced reader useful guidance of where to turn. AFAICT, it does not. ============== Specific comments about the content of this document: (1) Most of Sections 1, 2, and at least parts of 3, which together describe the problem, are quite good although there are some descriptions that one could quibble about and that probably deserve further consideration and revision. Those sections might make a good standalone document or an addition or appendix to the PRECIS work. On the other hand, they are probably inappropriate as they stand. As a trivial example, private use code points are used, as the document (and the Unicode specs) indicate, by private agreements among cooperating parties. But different cooperating parties may have very different uses for them and use them differently, making them a threat to general interoperability. Therefore saying that they are not problematic and are reasonable for use in general-purpose Unicode subsets is, well, problematic. (2) The three subsets of Section 4 are, especially if being considered as PRECIS Profiles (see Section 6) probably violate the basic assumption/ goal of PRECIS, which is to identify sets of characters/ code points (again "repertoires" or "Unicode Subsets" if one prefers) that are actually recommended under various, specified, circumstances. The first sentence of Section 4 describes these as "specifying acceptable content", but, as several of the other documents mentioned in this review explain, simply excluding a few "problematic" code points is insufficient for a reasonable recommendation -- better than "set of all Unicode code points" or "just send valid UTF-8" as mentioned in this draft and elsewhere-- but, in practice, not very much better. As mentioned above, they also violate an explicit requirement of PRECIS (Section 5.1 of RFC 8264) that directs against adding additional profiles without clear justification. This document does not appear to even attempt to provide that justification, indeed, the last sentence of the introductory paragraph to its Section 6 very nearly says that these are inadequate. (3) The last paragraph of Section 7, Security Considerations, indicates that the use of these subsets will make strings that follow them "less and less susceptible to vulnerabilities" While that is true, many of the other documents mentioned above, including PRECIS itself, strongly suggest that it is misleading in the sense that the three subsets listed will provide only minimal protection against accidental security-threatening problems and even less against attacks. (4) Another possible distinction is between text strings that are actually to be used (including processing in any way other than, e.g., copying) on the Internet and those that are somehow transmitted without being touched. Even copying can be fragile for some strings. That distinction is important to the proposed PRECIS profiles because PRECIS is definitely about use and string comparison in particular. As soon as one is going to compare strings or otherwise do i18n processing, multiple issues arise including the possibility of false positives (or negatives) on comparison, subtle problems with bidirectional strings, and so on. The draft appears to ignore all of those issues. That not only makes the proposed PRECIS profiles inappropriate as PRECIS additions but raises security issues that should, at least, be documented more exactly than handwaving references to other documents that do address those issues. ============= Quibbles and nits: (i) Section 1.1 (notation): Since the \uNNNN form is used later in the document (albeit in a JSON-related example), it is probably worth mentioning the Unicode definition of that form alone with the U+NNNN one, ideally including an explanation of when one or the other is preferred (or pointing to the relevant section of the Unicode spec about that). (ii) Section 2.2.2.2: Given the use by several programming languages and associated data representation of hex zero as a string terminator, it is likely to occur in data without programming errors being involved. The section should probably call that case out rather than just saying "including zero". (iii) While I recommend against registration of these subsets as PRECIS profiles above, I believe that I was the first to recommend that be done during review of an fairly early draft (IIR, maybe even an AD-requested one perhaps around -03 or -04) when I thought the spec was going to evolve in a different direction. That omission from the acknowledgments calls the care with which that section (and maybe other parts of the document) were constructed into doubt. Thanks, john --On Monday, February 3, 2025 16:46 -0800 The IESG <iesg-secretary@xxxxxxxx> wrote: > The IESG has received a request from an individual submitter to > consider the following document: - 'Unicode Character Repertoire > Subsets' <draft-bray-unichars-10.txt> as Proposed Standard > > The IESG plans to make a decision in the next few weeks, and > solicits final comments on this action. Please send substantive > comments to the last-call@xxxxxxxx mailing lists by 2025-03-03. > Exceptionally, comments may be sent to iesg@xxxxxxxx instead. In > either case, please retain the beginning of the Subject line to > allow automated sorting. [UAX31] https://unicode.org/reports/tr31/ [UTS39] https://www.unicode.org/reports/tr39/ [MSR] See, e.g., https://www.icann.org/en/system/files/files/msr-5-overview-24jun21-en.pdf [W3CLanguage] https://www.w3.org/International/typography/gap-analysis/language-matrix.html [W3CSpecdev] https://www.w3.org/International/i18n-drafts/nav/specdev -- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx