Re: bug#60690: -P '\d' in GNU and git grep

demerphq <demerphq@xxxxxxxxx> · Thu, 6 Apr 2023 15:39:31 +0200

On Tue, 4 Apr 2023 at 21:31, Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Paul Eggert <eggert@xxxxxxxxxxx> writes:
>
> > This is an evolving area. Git master is fiddling with flags and
> > options, and so is GNU grep master, and so is PCRE2, and there are
> > bugs. If you're running bleeding-edge versions of this code you'll get
> > different behavior than if you're running grep 3.8, pcregrep 8.45,
> > Perl 5.36, and git 2.39.2 (which is what Fedora 37 has).
> >
> > What I'm fearing is that we may evolve into mutually incompatible
> > interpretations of how Perl regular expressions deal with UTF-8
> > text. That'd be a recipe for confusion down the road.
>
> Nicely said.  My personal inclination is to let Perl folks decide
> and follow them (even though I am skeptical about the wisdom of
> letting '\d' match anything other than [0-9]), but even in Git
> circle there would be different opinions, so I am glad that the
> discussion is visible on the list to those who are intrested.

Perl matches Unicode text according to the rules specified by the
Unicode consortium. It is the reference implementation for Unicode
regular expression matching. Unicode specifies that \d match any digit
in any script that it supports. Thus \d matches far more codepoints
than \p{PosixDigit} or [0-9] would.  Be aware that Unicode contains
and separates numbers and digits, eg, \x{1EC9E} represents a Lakh,
which is used in many Indian languages for 100,000, but which is not
considered a *digit* for obvious reasons.

FWIW, someone mentioned [[:digit:]] which matches the same as \d does
on Unicode strings and under the /u matching flag for regexes in Perl.
Arguably this was a mistake, [[:digit:]] is a POSIX character class,
and POSIX doesn't support Unicode so it should have matched [0-9] or
\p{PosixDigit}. But historically \d and [[:digit:]] in Perl were the
same and when \d was extended to meet the Unicode specification
[[:digit:]] came along for the ride likely inadvertently, thus
\p{PosixDigit} is equivalent to [0-9], but \p{XPosixDigit} is
equivalent to \d and [[:digit:]].

I notice that other posts in this thread have moved the conversation
on, and covered most of the points I wanted to make here. However I
wanted to say that there seem to be two different issues here. The
first is "what semantics do i expect from my regular expressions",
Unicode or legacy-ASCII, mostly this relates to case-insensitive
matching, but things like \d also surface discrepancies. The second is
"what encodings does the regular expression engine understand".
Unfortunately on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE), and there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.

So the question comes up, A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file. For instance if I have a file which
contains snippets of UTF8 encoded data, *and* snippets of data that is
illegal in UTF8, what should the regular expression engine do if it is
asked to do a case insensitive match against that file.

cheers,
yves