Thanks Tim.
In this case, I was also the reviewer. However, all comments from the WG do go through the working group and they made suggestions.Thanks for that, Addison, and to the others whose input you are forwarding, assuming that’s not all you. A couple of the points have branched off into separate threads but I’ll consolidate other reactions here.
Section 2 of the I-D provides a description of characters and code points for local use in the document. The above quoted paragraph might be improved by:
- mention the hex size of the code point space (0x10FFFF or 0x10FFFD if you prefer) next to or instead of the weird decimal number.
Or perhaps 17⨉2¹⁶.
Maybe I'm too close to the Unicode stuff, but the magic number for me is usually 0x10FFFF. Most technical folks understand the hex and appreciate its size. But I think this is an editorial suggestion at best.
- the phrase "It is difficult to specify that unassigned code points should be avoided" understates the problem. We explicitly do not want to forbid unassigned code points that later do become assigned.
Perhaps “Since unassigned code points regularly become become assigned when new characters are added to Unicode, it is usually not a good practice to specify that unassigned code points should be avoided”?
[Generally the opinions I’m expressing here are weakly helpful, feel free to disagree/improve or suggest doing nothing.]
I think the stronger message is important. Tying oneself to a specific Unicode version is bad. Forbidding unassigned code points has no upside and, in recent years, Unicode has more-or-less stabilized the problem-causing character classes. So I'd make your sentence stronger. Perhaps:
> "Since unassigned code points regularly become become assigned when new characters are added to Unicode, you SHOULD avoid specifying that all (or specific sets of) unassigned code points be avoided."
#1981: "Transformation Formats" might be clearer as "character encoding"?
Unicode describes a variety of "transformation formats", ways to
marshal code points into byte sequences.Section 2.1 is labelled "Transformation Formats" and uses that term instead of the more familiar "character encoding" or "character encoding form". It is the case that "UTF" stands for "Unicode Transformation Format" and is part of the name of Unicode's character encodings, but that seems like a good footnote rather than something to be used in general.
I have the impression that “transformation formats” is the standard idiomatic Unicode terminology, would prefer to stay with that unless someone else wants to jump in here.
It _is_ a standard term (you can find it in the Unicode
Glossary), but, excepting talking quite specifically about UTF-8
or UTF-16 (when explaining what "UTF" stands for), I've never
heard it used by my peers to refer to encodings in general or even
to the Unicode-specific encodings. Your text is quite accurate to
say "Unicode describes a variety of 'transformation formats'..."
the way that is quoted above, since this refers explicitly to the
UTFs. But "character encoding form" (or just "character encoding")
is so common that I think you'd want to use that as the jargon and
the heading. Note that there exist specs like
encoding.spec.whatwg.org or the discussion in the ICU user guide
(https://unicode-org.github.io/icu/userguide/conversion/) where
"transformation format" is never mentioned. YMMV.
Agreed. These pretty much should be avoided. The smell is... odoriferous. But I called them out to avoid someone citing them as an oversight later ;-)Also, the poorly supported U+2028/2029 line endings aren't mentioned.
I think that’d be a distraction in the context of this document. There’s lots of smelly stuff in there.
I guess it depends how much additional information you want to put here. In the thread with Rob, I think it was, I suggested some alternatives if you wanted to mention alternate substitution regimes. All legacy encodings use another character (since they don't have U+FFFD). But your solution works.replacing problematic code points, ideally with "�" (U+FFFD,
REPLACEMENT CHARACTER), although some popular software platforms,
notably Java, use "?".This is probably incorrect.
Discussion so far makes me want to simply end this sentence at the comma before “although”.
Sounds good.#1984: Security consideration statement perhaps too bold?
Note that the Unicode-character subsets specified in this document
include a successively-decreasing number of problematic code points,
and thus should be less and less susceptible to vulnerabilities. The
Section 4.3 subset, "Unicode Assignables", excludes all of them.Saying that the Section 4.3 subset excludes "all of them" suggests that no exploits remain. The preceding paragraph mentions RFC8264's security considerations applies here also, and that document is somewhat thorough. Since homographs cannot be eliminated, maybe this should say something slightly different? Perhaps:
Note that the Unicode-character subsets specified in this document
successively exclude an increasing number of problematic code points,
and thus should be less and less susceptible to many of these exploits.
The Section 4.3 subset, "Unicode Assignables", excludes all of the
functionally problematic code points.OK, but I’d probably say “these” instead of “the functionally”.
I should mention, however, that UTS#55 probably should be mentioned/considered. "Trojan Source" attacks using bidi formatting characters can affect protocol text and document formats. This is probably a gap that needs mentioning. Mentioning homographs and confusables is probably worth a couple of words?
Agreed on mentioning UTF#55. But I really don’t want to start going down the slippery slope of all the deceive-the-eye attack flavors in this doc. The referenced docs (including #55) say the right things at the appropriate length.
I think that's right.
Best regards,
Addison
-- Addison Phillips Chair (W3C Internationalization WG) Internationalization is not a feature. It is an architecture.
-- last-call mailing list -- last-call@xxxxxxxx To unsubscribe send an email to last-call-leave@xxxxxxxx