[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

Addison Phillips <addisoni18n@xxxxxxxxx> · Tue, 11 Feb 2025 16:24:48 -0800

    Hi Tim,
    Always good to catch up... it's been a minute.

      > I do note that none of these documents offer the succinct
        discussion of why  some code points are considered “problematic”
        and what they are that Unichars does, and I think that none
        would work as well as citation targets in the IETF context as
        Unichars.

    Most of the documents I'm referring to in this space are
      concerned with parochial concerns, usually namespacing and
      identifier handling in a specific format or protocol, in which the
      restrictions on the repertoire is focused on local problems
      (although usually also concerned with common Unicode quirks), or
      which (like W3C's string-meta, charmod and specdev docs) are
      focused on helping those people address those parochial concerns.
    Unichars has a broader ambit, and thus could be a very useful
      addition. I think what you're working on would actually make a
      good Unicode Technical Report, although I recognize that IETF
      specs have specific needs and also that integrating with PRECIS is
      desirable. 

    I still think that harmonization is a good idea. Please note that
      I am not saying "you're wrong and should conform to XXX". I am
      saying that (among other things) other I18N-interested groups
      should ensure that we're all saying the same things. W3C-I18N's
      docs should review our guidelines to ensure consistency. I suspect
      WHATWG Infra and various UTRs could also absorb some lessons here.
      This way we might avoid variations of Not Invented Here, in which
      specification authors (specification in IETF would mean
      "Internet-Draft") cite the standard they like most as a reason not
      to pay attention to valid technical arguments raised by others or
      in which there are subtle tripping hazards between (say) what some
      W3C format says is valid and what some IETF protocol does. 

    An example of this recently was W3C TAG's "design principles",
      which recommended that, when in doubt, use DOMString (UTF-16 code
      unit strings), while W3C I18N recommended that, when in doubt, use
      Unicode code point strings. (In fact, when one read the technical
      details, both were making identical recommendations... but this
      was not obvious to readers.) Both groups are working to fix this
      (apparent) disagreement.
    Finally I'll add: I wasn't sure if this I-D was reacting to a
      perceived difficulty with existing standards, such as UAX31,
      UTS39, or UTS55, which, if they have gaps or problems, should be
      rectified there (regardless of the advancement of Unichars).
    Best regards,

    Addison

    On 2/11/2025 2:01 PM, Tim Bray wrote:

      On Feb 10,
          2025 at 1:08:49 PM, Addison Phillips <addisoni18n@xxxxxxxxx>
          wrote:

              All,
              The W3C Internationalization Working Group (of which I
                am chair) was requested to review several IETF documents
                nearing or in IETF Last Call.
              I have some concerns about the purpose of this I-D.
                There are a lot of documents in various standards bodies
                trying to address similar issues. I think harmonization
                of these types of documents is strongly desirable.

        I consulted with Addison and
          he pointed em to a couple of those documents, which
          transitively turned up more.  Details below, but all of these
          are generally consistent with the Unichars approach, with
          broad agreement on what should be excluded.  There are
          examples of excluding \n, \r, \t, which Unichars doesn’t, but
          those recommendations are specific to use in Identifiers.

        I don’t really feel any need
          for harmonization, but others may disagree upon looking at the
          source data.  I do note that none of these documents offer the
          succinct discussion of why  some code points are considered
          “problematic” and what they are that Unichars does, and I
          think that none would work as well as citation targets in the
          IETF context as Unichars.

        Details below.

        The 2005 W3C Charmod https://www.w3.org/TR/charmod/ says
        ==============

          C070 [S]  Specifications
            should not arbitrarily exclude code points from the full
            range of Unicode code points from U+0000 to U+10FFFF
            inclusive.

          C077 [S]  Specifications
            must not allow code points above U+10FFFF.

          Unicode contains some code
            points for internal use (such as noncharacters) or special
            functions (such as surrogate code points).

          C079 [S] Specifications
            should not allow the use of codepoints reserved by Unicode
            for internal use.

          C078 [S]  Specifications
            must not allow the use of surrogate code points.
          ===============

          The 2021 W3C Character
            Model for the World Wide Web: String Matching https://www.w3.org/TR/charmod-norm/ says
          ===============

            Specifications SHOULD NOT
              allow surrogate code points (U+D800 to U+DFFF) or
              non-character code points in identifiers.

            Specifications SHOULD NOT
              allow the C0 (U+0000 to U+001F) and C1 (U+0080 to U+009F)
              control characters in identifiers.
            ===============

            In Unicode
              Consortium UNICODE IDENTIFIER AND PATTERN SYNTAX https://www.unicode.org/reports/tr31/tr31-33.html

            Section 3, Immutable
              Identifiers, https://www.unicode.org/reports/tr31/tr31-33.html#Immutable_Identifier_Syntax discusses
              this in some depth, offering the subset that Unichars
              calls “XML Characters” as a reasonable example of
              subsetting.  I reproduce some of the text:

            ===============

              UAX31-R2. Immutable
                Identifiers: To meet this requirement, an implementation
                shall define identifiers to be any non-empty string of
                characters that contains no character having any of the
                following property values:

              Pattern_White_Space=True
              Pattern_Syntax=True
              General_Category=Private_Use,
                Surrogate, or Control
              Noncharacter_Code_Point=True
              Alternatively, it shall
                declare that it uses a profile and define that profile
                with a precise specification of the characters that are
                added to or removed from the sets of code points defined
                by these properties.

              In its profile, a
                specification can define identifiers to be more in
                accordance with the Unicode identifier definitions at
                the time the profile is adopted, while still allowing
                for strict immutability. 
              ================

              The October 2024
                W3C Internationalization Best Practices for Spec
                Developers https://www.w3.org/TR/international-specs/ says
              ================
              Specifications SHOULD
                NOT arbitrarily exclude code points from the full range
                of Unicode code points from U+0000 to U+10FFFF
                inclusive.

                Specifications MUST
                  NOT allow code points above U+10FFFF.

                Specifications SHOULD
                  NOT allow the use of codepoints reserved by Unicode
                  for internal use.

                Specifications MUST
                  NOT allow the use of unpaired surrogate code points.

                Specifications SHOULD
                  exclude compatibility characters in the syntactic
                  elements (markup, delimiters, identifiers) of the
                  formats they define.

                Specifications SHOULD
                  allow the full range of Unicode for user-defined
                  values.

                =================

    -- 
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx