On Fri, Apr 7, 2023 at 12:00 PM Paul Eggert <eggert@xxxxxxxxxxx> wrote: > > On 2023-04-06 06:39, demerphq wrote: > > > Unicode specifies that \d match any digit > > in any script that it supports. > > "Specifies" is too strong. The Unicode Regular Expressions technical > standard (UTS#18) mentions \d only in Annex C[1], next to the word > "digit" in a column labeled "Property" (even though \d is really syntax > not a property). This is at best an informal recommendation, not a > requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for > illustration and that although it's similar to Perl's, the two syntax > forms may not be exactly the same. So we can't look to UTS#18 for a > definitive way out of the \d mess, as the Unicode folks specifically > delegated matters to us. > > Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C > says "\p{gc=Decimal_Number}" is the standard recommended syntax > assignment for digits. However, PCRE2 does not support this syntax; it > supports another variant \p{Nd} that UTS#18 also recommends. So it > appears that PCRE2 already does not implement every recommended aspect > of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support > "\p{gc=Decimal_Number}". Not sure I follow the whole logic here, but PCRE2[3] (search for "general category" which is what the "gc" above stands for) only supports the abbreviated form of the unicode classes and `Nd` is indeed the one that corresponds to `Decimal_Number`. Carlo [1]: https://unicode.org/reports/tr18/#Compatibility_Properties [2]: https://unicode.org/reports/tr18/#Conformance [3]: https://pcre2project.github.io/pcre2/doc/html/pcre2pattern.html