Re: bug#60690: -P '\d' in GNU and git grep

demerphq <demerphq@xxxxxxxxx> · Thu, 6 Apr 2023 17:45:09 +0200

On Wed, 5 Apr 2023 at 20:32, Paul Eggert <eggert@xxxxxxxxxxx> wrote:
>
> On 2023-04-04 12:31, Junio C Hamano wrote:
>
> > My personal inclination is to let Perl folks decide
> > and follow them (even though I am skeptical about the wisdom of
> > letting '\d' match anything other than [0-9])
>
> I looked into what pcre2grep does. It has always done only 8-bit
> processing unless you use the -u or --utf option, so plain "pcre2grep
> '\d'" matches only ASCII digits.
>
> Although this causes pcre2grep to mishandle Unicode characters:
>
>    $ echo 'Ævar' | pcre2grep '[Ssß]'
>    Ævar
>
> it mimics Perl 5.36:
>
>    $ echo 'Ævar' | perl -ne 'print $_ if /[Ssß]/'
>    Ævar
>
> so this seems to be what Perl users expect, despite its infelicities.

Actually no, I think you have misunderstood what is happening at the
different layers involved here.

Your terminal is rendering ß as a glyph. But it is almost certainly
actually the octets C3 9F (which is the UTF8 canonical representation
of the codepoint U+DF). So the code you provided to perl is close to
the equivalent of

echo 'Ævar' | perl -ne 'print $_ if /[Ss\x{C3}\x{9F}]/'

And if you check, you will see that U+C6 "Æ" in utf8 is represented as
the octets C3 86.

So what you have done is the equivalent of:

perl -le'print "\x{C3}\x{86}"' | perl -ne'print $_ if /[Ss\x{C3}\x{9F}]/'

which of course matches. \x{C3} matches \x{C3} always and everywhere.

What you should have done is something like this:

$ echo 'Ævar' | perl -ne 'utf8::decode($_); print $_ if /[Ss\x{DF}]/u'
$ echo 'baß' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{DF}]/u'
baß
$ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{C6}]/u'
Ævar
$ echo 'Ævar' | perl -MEncode -ne 'utf8::decode($_); print
encode_utf8($_) if /[Ss\x{e6}]/ui'
Ævar

The "utf8::decode($_)" tells perl to decode the input string as though
it contained utf8 (which in this case it does). THe /u suffix tells
the regex engine that you want Unicode semantics.

I believe that the same thing is true of your pcre2grep example. You
simply aren't checking what you think you are checking. You terminal
renders UTF8 as glyphs, but the programs you are feeding those glyphs
to aren't seeing glyphs, they are seeing UTF8 sequences as distinct
octets, and are not decoding their input back as codepoints.

You could have checked your assumptions by using the -Mre=debug option to perl:

$ echo 'Ævar' | perl -Mre=debug -ne 'print $_ if /[Ssß]/'
Compiling REx "[Ss%x{c3}%x{9f}]"
Final program:
   1: ANYOF[Ss\x9F\xC3] (11)
  11: END (0)
stclass ANYOF[Ss\x9F\xC3] minlen 1
Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n"
Matching stclass ANYOF[Ss\x9F\xC3] against "%x{c3}%x{86}var%n" (6 bytes)
   0 <> <%x{c3}>             |   0| 1:ANYOF[Ss\x9F\xC3](11)
   1 <%x{c3}> <%x{86}var>    |   0| 11:END(0)
Match successful!
Ævar
Freeing REx: "[Ss%x{c3}%x{9f}]"

The line: Matching REx "[Ss%x{c3}%x{9f}]" against "%x{c3}%x{86}var%n"

basically says it all. Perl has not decoded the UTF8 into U+C6, and it
has not decoded the UTF8 for U+DF either. Instead you have asked it if
the UTF8 sequence that represents U+C6 contains any of the same octets
as the UTF8 representation of U+53, U+73 and U+DF would. Which gives
the common octet of \x{c3}.

> For better Unicode handling one can use pcre2grep's -u or --utf option,
> which causes pcre2grep to behave more like GNU grep -P and git grep -P:
> "echo 'Ævar' | pcre2grep -u '[Ssß]'" outputs nothing, which I think is
> what most people would expect (unless they're Perl users :-).

It is what Perl users would expect also, assuming you actually wrote
the character class [Ss\x{DF}] and asked for unicode semantics. \x{DF}
is the Latin1 codepoint range, so perl will assume that you meant
ASCII semantics unless you tell it otherwise.

Basically these tests you have quoted here are just examples of
garbage in garbage out.

Perl has been working together with the Unicode consortium for over 20
years. Afaik we were and are the reference implementation for the spec
on regular expression matching in Unicode and we have a long history
of working together with the Unicode consortium to refine and
implement the spec. You should assume that if Perl seems to have made
a gross error in how it does Unicode matching that you are simply
using it wrong, we take a great deal of pride in having the best
Unicode support there is.

https://unicode.org/reports/tr18/

FWIW, i think this email nicely illustrates the issues with git and
regular expressions. To do regular expressions properly you need to
know a) what semantics do you expect, b) how to decode the text you
are matching against. If you want unicode semantics you need to have a
way to ask for it. If you want to match against Unicode data then you
need a way to determine which of the 6 possible encodings[1] of
Unicode data you are using. If you get either wrong you will not get
the results you expect.  You may even want to deal with cases where
you want Unicode semantics, but to match against non-unicode data. For
instance Latin-1. In Latin-1 the codepoint U+DF is the *octet* 0xDF.
Maybe you want that octet to match "ss" case-insensitively, as a
German speaker would expect and as Unicode specifies is correct.  Or
vice versa, maybe you are like some of the posters to this thread who
seem to expect that \d should not match U+16B51 (as a Hmong speaker
might expect). Perl resolves these problems at the pattern level by
supporting the suffixes /a and /u (for ascii and unicode), and at the
string level it supports two type of string, unicode strings, and
binary/ASCII strings. By default input is the latter but there are a
variety of ways of saying that a file handle should decode to Unicode
instead.

cheers,
Yves
[1] UTF-EBCDIC, UTF-8, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.

--
perl -Mre=debug -e "/just|another|perl|hacker/"