[Last-Call] Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

Addison Phillips <addisoni18n@xxxxxxxxx> · Mon, 10 Feb 2025 13:08:49 -0800

    All,
    The W3C Internationalization Working Group (of which I am chair)
      was requested to review several IETF documents nearing or in IETF
      Last Call.
    This email represents the issues our working group noticed in our
      review of:
    https://datatracker.ietf.org/doc/draft-bray-unichars/
    I have some concerns about the purpose of this I-D. There are a
      lot of documents in various standards bodies trying to address
      similar issues. I think harmonization of these types of documents
      is strongly desirable.
    ---

    The specific issues our group identified are tracked in github
      here:
    https://github.com/w3c/i18n-activity/issues?q=is%3Aissue%20state%3Aopen%20label%3As%3Aunichars
    ---

    Here are the comments:
    #1980: Quibbles about characters and code points
    https://datatracker.ietf.org/doc/draft-bray-unichars/

      There are 1,114,112 code points; as of Unicode 15.1
        (2023), fewer

        than 150,000 have been assigned to characters. It is difficult
        to

        specify that unassigned code points should be avoided because
        they

        regularly become assigned when new characters are added to
        Unicode.

    Section 2 of the I-D provides a description of
      characters and code points for local use in the document. The
      above quoted paragraph might be improved by:

      mention the hex size of the code point space (0x10FFFF or
        0x10FFFD if you prefer) next to or instead of the weird decimal
        number.
      the phrase "It is difficult to specify that unassigned code
        points should be avoided" understates the problem. We explicitly
        do not want to forbid unassigned code points that later do
        become assigned.

    ---
    #1981: "Transformation Formats" might be clearer as "character
      encoding"?

      Unicode describes a variety of "transformation
        formats", ways to

        marshal code points into byte sequences. A survey of
        transformation

        formats is beyond the scope of this document. However, it is
        useful

        to note that the "UTF-16" format represents each code point with
        one

        or two 16-bit chunks, and the "UTF-8" format uses
        variable-length

        byte sequences.

    Section 2.1 is labelled "Transformation Formats" and
      uses that term instead of the more familiar "character encoding"
      or "character encoding form". It is the case that "UTF" stands for
      "Unicode Transformation Format" and is part of the name of
      Unicode's character encodings, but that seems like a good footnote
      rather than something to be used in general.
    ---
    #1982: C1 controls, Unicode line endings
    Section 2.2.2 introduces control codes and talks
      specifically about the C0 controls. The C1 controls are mentioned
      en passant in section 3:

      The value of the "example" field contains the C0
        control NUL, the C1

        control "CHARACTER TABULATION WITH JUSTIFICATION", an
        unpaired...

    ... but not elsewhere. Possibly the C1 controls should
      be dealt with in 2.2.2?
    Also, the poorly supported U+2028/2029 line endings
      aren't mentioned.
    ---
    #1983: Replacement character examples

      replacing problematic code points, ideally with "�"
        (U+FFFD,

        REPLACEMENT CHARACTER), although some popular software
        platforms,

        notably Java, use "?".

    This is probably incorrect. Java replaces with U+FFFD
      in most Unicode processing (including decoding from legacy
      encodings). (Encoding to legacy encodings in Java use
      "?"). There are other places, such as certain browsers, where "?"
      is used in a Unicode context.
    Note that there exist common coders that use the
      control character U+001A (SUB) as a replacement character for some
      legacy encodings.
    ---
    #1984: Security consideration statement perhaps too
      bold?

      Note that the Unicode-character subsets specified in
        this document

        include a successively-decreasing number of problematic code
        points,

        and thus should be less and less susceptible to vulnerabilities.
        The

        Section 4.3 subset, "Unicode Assignables", excludes all of them.

    Saying that the Section 4.3 subset excludes "all of
      them" suggests that no exploits remain. The preceding paragraph
      mentions RFC8264's security considerations applies here also, and
      that document is somewhat thorough. Since homographs cannot be eliminated, maybe this
      should say something slightly different? Perhaps:

      Note that the Unicode-character subsets specified in
        this document

        successively exclude an increasing number of problematic code
        points,

        and thus should be less and less susceptible to many of these
        exploits.

        The Section 4.3 subset, "Unicode Assignables", excludes all of
        the

        functionally problematic code points.

    I should mention, however, that UTS#55 probably should
      be mentioned/considered. "Trojan Source" attacks using bidi
      formatting characters can affect protocol text and document
      formats. This is probably a gap that needs mentioning. Mentioning
      homographs and confusables is probably worth a couple of words?
    === end of comments ===
    I'll note that this is the first W3C I18N review of an
      IETF last call (in recent memory). We track our issues in github,
      which is heavily tailored to the W3C Process and tools. I am aware
      that this is incompatible with IETF's processes and tooling. I
      apologize in advance for any inconvenience that my providing
      comments might cause and invite feedback on how we can do better.
    Also, for visibility, I have blindcopied (to avoid
      cross-posting issues) this message to our public list
      (https://lists.w3.org/Archives/Public/public-i18n-core/)

    Regards (for W3C I18N),
    Addison

    -- 
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx