[Last-Call] Re: Last Call: <draft-bray-unichars-10.txt> (Unicode Character Repertoire Subsets): W3C I18N Review

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim,

Always good to catch up... it's been a minute.

> I do note that none of these documents offer the succinct discussion of why  some code points are considered “problematic” and what they are that Unichars does, and I think that none would work as well as citation targets in the IETF context as Unichars.

Most of the documents I'm referring to in this space are concerned with parochial concerns, usually namespacing and identifier handling in a specific format or protocol, in which the restrictions on the repertoire is focused on local problems (although usually also concerned with common Unicode quirks), or which (like W3C's string-meta, charmod and specdev docs) are focused on helping those people address those parochial concerns.

Unichars has a broader ambit, and thus could be a very useful addition. I think what you're working on would actually make a good Unicode Technical Report, although I recognize that IETF specs have specific needs and also that integrating with PRECIS is desirable.

I still think that harmonization is a good idea. Please note that I am not saying "you're wrong and should conform to XXX". I am saying that (among other things) other I18N-interested groups should ensure that we're all saying the same things. W3C-I18N's docs should review our guidelines to ensure consistency. I suspect WHATWG Infra and various UTRs could also absorb some lessons here. This way we might avoid variations of Not Invented Here, in which specification authors (specification in IETF would mean "Internet-Draft") cite the standard they like most as a reason not to pay attention to valid technical arguments raised by others or in which there are subtle tripping hazards between (say) what some W3C format says is valid and what some IETF protocol does.

An example of this recently was W3C TAG's "design principles", which recommended that, when in doubt, use DOMString (UTF-16 code unit strings), while W3C I18N recommended that, when in doubt, use Unicode code point strings. (In fact, when one read the technical details, both were making identical recommendations... but this was not obvious to readers.) Both groups are working to fix this (apparent) disagreement.

Finally I'll add: I wasn't sure if this I-D was reacting to a perceived difficulty with existing standards, such as UAX31, UTS39, or UTS55, which, if they have gaps or problems, should be rectified there (regardless of the advancement of Unichars).

Best regards,

Addison

On 2/11/2025 2:01 PM, Tim Bray wrote:
On Feb 10, 2025 at 1:08:49 PM, Addison Phillips <addisoni18n@xxxxxxxxx> wrote:

All,

The W3C Internationalization Working Group (of which I am chair) was requested to review several IETF documents nearing or in IETF Last Call.

I have some concerns about the purpose of this I-D. There are a lot of documents in various standards bodies trying to address similar issues. I think harmonization of these types of documents is strongly desirable.

I consulted with Addison and he pointed em to a couple of those documents, which transitively turned up more.  Details below, but all of these are generally consistent with the Unichars approach, with broad agreement on what should be excluded.  There are examples of excluding \n, \r, \t, which Unichars doesn’t, but those recommendations are specific to use in Identifiers.

I don’t really feel any need for harmonization, but others may disagree upon looking at the source data.  I do note that none of these documents offer the succinct discussion of why  some code points are considered “problematic” and what they are that Unichars does, and I think that none would work as well as citation targets in the IETF context as Unichars.

Details below.

The 2005 W3C Charmod https://www.w3.org/TR/charmod/ says
==============
C070 [S]  Specifications should not arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

C077 [S]  Specifications must not allow code points above U+10FFFF.

Unicode contains some code points for internal use (such as noncharacters) or special functions (such as surrogate code points).

C079 [S] Specifications should not allow the use of codepoints reserved by Unicode for internal use.

C078 [S]  Specifications must not allow the use of surrogate code points.
===============

The 2021 W3C Character Model for the World Wide Web: String Matching https://www.w3.org/TR/charmod-norm/ says
===============
Specifications SHOULD NOT allow surrogate code points (U+D800 to U+DFFF) or non-character code points in identifiers.

Specifications SHOULD NOT allow the C0 (U+0000 to U+001F) and C1 (U+0080 to U+009F) control characters in identifiers.
===============

In Unicode Consortium UNICODE IDENTIFIER AND PATTERN SYNTAX https://www.unicode.org/reports/tr31/tr31-33.html

Section 3, Immutable Identifiers, https://www.unicode.org/reports/tr31/tr31-33.html#Immutable_Identifier_Syntax discusses this in some depth, offering the subset that Unichars calls “XML Characters” as a reasonable example of subsetting.  I reproduce some of the text:
===============
UAX31-R2. Immutable Identifiers: To meet this requirement, an implementation shall define identifiers to be any non-empty string of characters that contains no character having any of the following property values:

Pattern_White_Space=True
Pattern_Syntax=True
General_Category=Private_Use, Surrogate, or Control
Noncharacter_Code_Point=True
Alternatively, it shall declare that it uses a profile and define that profile with a precise specification of the characters that are added to or removed from the sets of code points defined by these properties.

In its profile, a specification can define identifiers to be more in accordance with the Unicode identifier definitions at the time the profile is adopted, while still allowing for strict immutability. 
================

The October 2024 W3C Internationalization Best Practices for Spec Developers https://www.w3.org/TR/international-specs/ says
================
Specifications SHOULD NOT arbitrarily exclude code points from the full range of Unicode code points from U+0000 to U+10FFFF inclusive.

Specifications MUST NOT allow code points above U+10FFFF.

Specifications SHOULD NOT allow the use of codepoints reserved by Unicode for internal use.

Specifications MUST NOT allow the use of unpaired surrogate code points.

Specifications SHOULD exclude compatibility characters in the syntactic elements (markup, delimiters, identifiers) of the formats they define.

Specifications SHOULD allow the full range of Unicode for user-defined values.
=================





-- 
Addison Phillips
Chair (W3C Internationalization WG)

Internationalization is not a feature.
It is an architecture.
-- 
last-call mailing list -- last-call@xxxxxxxx
To unsubscribe send an email to last-call-leave@xxxxxxxx

[Index of Archives]     [IETF Annoucements]     [IETF]     [IP Storage]     [Yosemite News]     [Linux SCTP]     [Linux Newbies]     [Mhonarc]     [Fedora Users]

  Powered by Linux