[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

Addison Phillips <addisoni18n@xxxxxxxxx> · Fri, 14 Feb 2025 16:40:49 -0800

    Thanks Tim.

       Thanks for that, Addison, and to the others whose
        input you are forwarding, assuming that’s not all you.  A couple
        of the points have branched off into separate threads but I’ll
        consolidate other reactions here.

    In this case, I was also the reviewer. However, all comments from
    the WG do go through the working group and they made suggestions.

              Section 2 of the I-D provides a description
                of characters and code points for local use in the
                document. The above quoted paragraph might be improved
                by:

                mention the hex size of the code point space
                  (0x10FFFF or 0x10FFFD if you prefer) next to or
                  instead of the weird decimal number.

        Or perhaps 17⨉2¹⁶.

    Maybe I'm too close to the Unicode stuff, but the magic number
      for me is usually 0x10FFFF. Most technical folks understand the
      hex and appreciate its size. But I think this is an editorial
      suggestion at best.

                the phrase "It is difficult to specify that
                  unassigned code points should be avoided" understates
                  the problem. We explicitly do not want to forbid
                  unassigned code points that later do become assigned.

        Perhaps “Since unassigned
          code points regularly become become assigned when new
          characters are added to Unicode, it is usually not a good
          practice to specify that unassigned code points should be
          avoided”?

        [Generally the opinions I’m
          expressing here are weakly helpful, feel free to
          disagree/improve or suggest doing nothing.]

    I think the stronger message is important. Tying oneself to a
      specific Unicode version is bad. Forbidding unassigned code points
      has no upside and, in recent years, Unicode has more-or-less
      stabilized the problem-causing character classes. So I'd make your
      sentence stronger. Perhaps:
    > "Since unassigned code points regularly become become
      assigned when new characters are added to Unicode, you SHOULD
      avoid specifying that all (or specific sets of) unassigned code
      points be avoided."

          #1981: "Transformation Formats" might be clearer as
            "character encoding"?

            Unicode describes a variety of "transformation
              formats", ways to

              marshal code points into byte sequences.

              Section 2.1 is labelled "Transformation
                Formats" and uses that term instead of the more familiar
                "character encoding" or "character encoding form". It is
                the case that "UTF" stands for "Unicode Transformation
                Format" and is part of the name of Unicode's character
                encodings, but that seems like a good footnote rather
                than something to be used in general.

          I have the impression that
            “transformation formats” is the standard idiomatic Unicode
            terminology, would prefer to stay with that unless someone
            else wants to jump in here.

    It _is_ a standard term (you can find it in the Unicode
      Glossary), but, excepting talking quite specifically about UTF-8
      or UTF-16 (when explaining what "UTF" stands for), I've never
      heard it used by my peers to refer to encodings in general or even
      to the Unicode-specific encodings. Your text is quite accurate to
      say "Unicode describes a variety of 'transformation formats'..."
      the way that is quoted above, since this refers explicitly to the
      UTFs. But "character encoding form" (or just "character encoding")
      is so common that I think you'd want to use that as the jargon and
      the heading. Note that there exist specs like
      encoding.spec.whatwg.org or the discussion in the ICU user guide
      (https://unicode-org.github.io/icu/userguide/conversion/) where
      "transformation format" is never mentioned. YMMV.

              Also, the poorly supported U+2028/2029 line
                endings aren't mentioned.

        I think that’d be a
          distraction in the context of this document. There’s lots of
          smelly stuff in there.

    Agreed. These pretty much should be avoided. The smell is...
    odoriferous. But I called them out to avoid someone citing them as
    an oversight later ;-)

                replacing problematic code points, ideally
                  with "�" (U+FFFD,

                  REPLACEMENT CHARACTER), although some popular software
                  platforms,

                  notably Java, use "?".

              This is probably incorrect.

          Discussion so far makes me
            want to simply end this sentence at the comma before
            “although”. 

    I guess it depends how much additional information you want to put
    here. In the thread with Rob, I think it was, I suggested some
    alternatives if you wanted to mention alternate substitution
    regimes. All legacy encodings use another character (since they
    don't have U+FFFD). But your solution works.

              #1984: Security consideration statement
                perhaps too bold?

                Note that the Unicode-character subsets
                  specified in this document

                  include a successively-decreasing number of
                  problematic code points,

                  and thus should be less and less susceptible to
                  vulnerabilities. The

                  Section 4.3 subset, "Unicode Assignables", excludes
                  all of them.

              Saying that the Section 4.3 subset excludes
                "all of them" suggests that no exploits remain. The
                preceding paragraph mentions RFC8264's security
                considerations applies here also, and that document is
                somewhat thorough. Since homographs
                cannot be eliminated, maybe this should say something
                slightly different? Perhaps:

                Note that the Unicode-character subsets
                  specified in this document

                  successively exclude an increasing number of
                  problematic code points,

                  and thus should be less and less susceptible to many
                  of these exploits.

                  The Section 4.3 subset, "Unicode Assignables",
                  excludes all of the

                  functionally problematic code points.

        OK, but I’d probably say
          “these” instead of “the functionally”.

    Sounds good.

              I should mention, however, that UTS#55
                probably should be mentioned/considered. "Trojan Source"
                attacks using bidi formatting characters can affect
                protocol text and document formats. This is probably a
                gap that needs mentioning. Mentioning homographs and
                confusables is probably worth a couple of words?

        Agreed on mentioning UTF#55.
          But I really don’t want to start going down the slippery slope
          of all the deceive-the-eye attack flavors in this doc. The
          referenced docs (including #55) say the right things at the
          appropriate length.

    I think that's right.
    Best regards,
    Addison

    -- 
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx