[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

Tim Bray <tbray@xxxxxxxxxxxxxx> · Tue, 11 Feb 2025 14:01:12 -0800

On Feb 10, 2025 at 1:08:49 PM, Addison Phillips <addisoni18n@xxxxxxxxx> wrote:

    All,
    The W3C Internationalization Working Group (of which I am chair)
      was requested to review several IETF documents nearing or in IETF
      Last Call.
    I have some concerns about the purpose of this I-D. There are a
      lot of documents in various standards bodies trying to address
      similar issues. I think harmonization of these types of documents
      is strongly desirable.
I consulted with Addison and he pointed em to a couple of those documents, which transitively turned up more.  Details below, but all of these are generally consistent with the Unichars approach, with broad agreement on what should be excluded.  There are examples of excluding \n, \r, \t, which Unichars doesn’t, but those recommendations are specific to use in Identifiers.

I don’t really feel any need for harmonization, but others may disagree upon looking at the source data.  I do note that none of these documents offer the succinct discussion of why  some code points are considered “problematic” and what they are that Unichars does, and I think that none would work as well as citation targets in the IETF context as Unichars.

Details below.

The 2005 W3C Charmod https://www.w3.org/TR/charmod/ says
==============
C070 [S]  Specifications should not arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

C077 [S]  Specifications must not allow code points above U+10FFFF.

Unicode contains some code points for internal use (such as noncharacters) or special functions (such as surrogate code points).

C079 [S] Specifications should not allow the use of codepoints reserved by Unicode for internal use.

C078 [S]  Specifications must not allow the use of surrogate code points.
===============

The 2021 W3C Character Model for the World Wide Web: String Matching https://www.w3.org/TR/charmod-norm/ says
===============
Specifications SHOULD NOT allow surrogate code points (U+D800 to U+DFFF) or non-character code points in identifiers.

Specifications SHOULD NOT allow the C0 (U+0000 to U+001F) and C1 (U+0080 to U+009F) control characters in identifiers.
===============

In Unicode Consortium UNICODE IDENTIFIER AND PATTERN SYNTAX https://www.unicode.org/reports/tr31/tr31-33.html

Section 3, Immutable Identifiers, https://www.unicode.org/reports/tr31/tr31-33.html#Immutable_Identifier_Syntax discusses this in some depth, offering the subset that Unichars calls “XML Characters” as a reasonable example of subsetting.  I reproduce some of the text:
===============
UAX31-R2. Immutable Identifiers: To meet this requirement, an implementation shall define identifiers to be any non-empty string of characters that contains no character having any of the following property values:

Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True
Alternatively, it shall declare that it uses a profile and define that profile with a precise specification of the characters that are added to or removed from the sets of code points defined by these properties.

In its profile, a specification can define identifiers to be more in accordance with the Unicode identifier definitions at the time the profile is adopted, while still allowing for strict immutability. 
================

The October 2024 W3C Internationalization Best Practices for Spec Developers https://www.w3.org/TR/international-specs/ says
================
Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

Specifications MUST NOT allow code points above U+10FFFF.

Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use.

Specifications MUST NOT allow the use of unpaired surrogate code points.

Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define.

Specifications SHOULD allow the full range of Unicode for user-defined values.
=================

-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx