#1980: Quibbles about characters and code points
https://datatracker.ietf.org/doc/draft-bray-unichars/
There are 1,114,112 code points; as of Unicode 15.1 (2023), fewer
than 150,000 have been assigned to characters. It is difficult to
specify that unassigned code points should be avoided because they
regularly become assigned when new characters are added to Unicode.Section 2 of the I-D provides a description of characters and code points for local use in the document. The above quoted paragraph might be improved by:
- mention the hex size of the code point space (0x10FFFF or 0x10FFFD if you prefer) next to or instead of the weird decimal number.
- the phrase "It is difficult to specify that unassigned code points should be avoided" understates the problem. We explicitly do not want to forbid unassigned code points that later do become assigned.
#1981: "Transformation Formats" might be clearer as "character encoding"?
Unicode describes a variety of "transformation formats", ways to
marshal code points into byte sequences.
Section 2.1 is labelled "Transformation Formats" and uses that term instead of the more familiar "character encoding" or "character encoding form". It is the case that "UTF" stands for "Unicode Transformation Format" and is part of the name of Unicode's character encodings, but that seems like a good footnote rather than something to be used in general.
#1982: C1 controls, Unicode line endings
Section 2.2.2 introduces control codes and talks specifically about the C0 controls. The C1 controls are mentioned en passant in section 3:
Also, the poorly supported U+2028/2029 line endings aren't mentioned.
replacing problematic code points, ideally with "�" (U+FFFD,
REPLACEMENT CHARACTER), although some popular software platforms,
notably Java, use "?".
This is probably incorrect.
#1984: Security consideration statement perhaps too bold?
Note that the Unicode-character subsets specified in this document
include a successively-decreasing number of problematic code points,
and thus should be less and less susceptible to vulnerabilities. The
Section 4.3 subset, "Unicode Assignables", excludes all of them.Saying that the Section 4.3 subset excludes "all of them" suggests that no exploits remain. The preceding paragraph mentions RFC8264's security considerations applies here also, and that document is somewhat thorough. Since homographs cannot be eliminated, maybe this should say something slightly different? Perhaps:
Note that the Unicode-character subsets specified in this document
successively exclude an increasing number of problematic code points,
and thus should be less and less susceptible to many of these exploits.
The Section 4.3 subset, "Unicode Assignables", excludes all of the
functionally problematic code points.
I should mention, however, that UTS#55 probably should be mentioned/considered. "Trojan Source" attacks using bidi formatting characters can affect protocol text and document formats. This is probably a gap that needs mentioning. Mentioning homographs and confusables is probably worth a couple of words?
- T
-- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx