Re: bug#60690: -P '\d' in GNU and git grep

Paul Eggert <eggert@xxxxxxxxxxx> · Fri, 7 Apr 2023 12:00:16 -0700

On 2023-04-06 06:39, demerphq wrote:

Unicode specifies that \d match any digit
in any script that it supports.

"Specifies" is too strong. The Unicode Regular Expressions technical 
standard (UTS#18) mentions \d only in Annex C[1], next to the word 
"digit" in a column labeled "Property" (even though \d is really syntax 
not a property). This is at best an informal recommendation, not a 
requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for 
illustration and that although it's similar to Perl's, the two syntax 
forms may not be exactly the same. So we can't look to UTS#18 for a 
definitive way out of the \d mess, as the Unicode folks specifically 
delegated matters to us.

Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C 
says "\p{gc=Decimal_Number}" is the standard recommended syntax 
assignment for digits. However, PCRE2 does not support this syntax; it 
supports another variant \p{Nd} that UTS#18 also recommends. So it 
appears that PCRE2 already does not implement every recommended aspect 
of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support 
"\p{gc=Decimal_Number}".

Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class, 
that's clearly enough for grep -P to conform to UTS#18 with respect to 
digits.

A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file.

Here's how GNU grep does it:

* RE semantics are specified via command-line options like -P.

* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.

* REs do not match encoding errors.

on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)

Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle 
UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe 
or MS-Windows code these legacy encodings are obviously a big deal. 
However, there seems little reason to force their nontrivial hassles 
onto every GNU/Linux program that processes text. A few specialized apps 
like 'iconv' deal with offbeat encodings, and that is probably a better 
approach all around.

there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.

That is a use case, yes. It's what 'strings' and 'grep' do.

[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance