Re: [RFC PATCH] grep: default to posix digits with -P

Carlo Arenas <carenas@xxxxxxxxx> · Tue, 2 Jan 2024 11:02:10 -0800

On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@xxxxxx> wrote:
>
> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
> > @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
> >               }
> >               options |= PCRE2_CASELESS;
> >       }
> > -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
> > -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
> > +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
> > +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> >
> > -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> > -     /*
> > -      * Work around a JIT bug related to invalid Unicode character handling
> > -      * fixed in 10.35:
> > -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> > -      */
> > -     options &= ~PCRE2_UCP;
> > +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
> > +             /*
> > +              * Work around a JIT bug related to invalid Unicode character handling
> > +              * fixed in 10.35:
> > +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
> > +              */
> > +             options |= PCRE2_UCP;
> > +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
> > +             if (!opt->perl_digit)
> > +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
> >  #endif
> > +#endif
>
> Why do we need that extra level of indentation?

I was just trying to simplify the code by including all the logic in
one single set.

The original lack of indentation that was introduced by later fixes to
the code was IMHO also misguided since the obvious "objective" as set
in the original code that added PCRE2_UCP was that it should be used
whenever UTF was also in use as shown by
acabd2048ee0ee53728100408970ab45a6dab65e.

Of course, we soon found out that the original implementation of
PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.

Note that the problematic code is only relevant when JIT is also
enabled, but JIT is almost always enabled.

> The old code turned PCRE2_UCP on by default and turned it off for older
> versions. The new code enables PCRE2_UCP only for newer versions.  The
> result should be the same, no?  So why change that part at all?

Because it gets us a little closer to the real reason why we need to
disable UCP for anything older than 10.35, and that is that there is a
bug there that is ONLY relevant if we are using JIT.

My hope though is that with the release of 10.43 (currently in RC1),
10.34 will become irrelevant soon enough and this whole code could be
cleaned up further.

> But the comment is now off, isn't it?  The workaround was turning
> PCRE2_UCP off for older versions (because those were broken), not
> turning it on for newer versions (because it would be required by some
> unfixed regression).

The comment was never correct, because it was turning it off, because
the combination of JIT + MATCH_INVALID_UTF (which was released in
10.34) + UCP is broken.

> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
> GIT_PCRE2_VERSION_10_43_OR_HIGHER?  Keeping them separate would help
> keep the #ifdef parts as short as possible and maintainable
> independently.

I thought that nesting them would make it simpler to maintain, since
there are somehow connected anyway.

The additional flags that are added in PCRE 10.43 are ONLY needed and
useful on top of PCRE2_UCP, which itself only makes sense on top of
UTF.

> > @@ -27,13 +34,13 @@ do
> >       if ! test_have_prereq PERF_GREP_ENGINES_THREADS
> >       then
> >               test_perf "grep -P '$pattern'" --prereq PCRE "
> > -                     git -P grep -f pat || :
> > +                     git grep -P -f pat || :
> >               "
> >       else
> >               for threads in $GIT_PERF_GREP_THREADS
> >               do
> >                       test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
> > -                             git -c grep.threads=$threads -P grep -f pat || :
> > +                             git -c grep.threads=$threads grep -P -f pat || :
>
> What's going on here?  You remove the -P option of git (--no-pager) and
> add the -P option of git grep (--perl-regexp).  So this perf test never
> tested PCRE despite its name?  Seems worth a separate patch.

My original code was a dud.  This would make that at least more obvious,

> Do the performance numbers in acabd2048e (grep: correctly identify utf-8
> characters with \{b,w} in -P, 2023-01-08) still hold up in that light?

FWIW the performance numbers still hold up, but just because I did the
tests independently using hyperfine, and indeed when I did the first
version of this patch, I had a nice reproducible description that
showed how to get 20% better performance while searching the git
repository itself for something like 4 digits (as used in Copyright
dates).

My idea was to reply to my own post with instructions on how to test
this, which basically require to also compile the recently released
10.43RC1 and build against that.

Indeed, AFAIK the test would fail if built with an older version of
PCRE, but wasn't sure if making a prerequisite that hardcoded the
version in test-tool might be the best approach to prevent that,
especially with all the libification work.

Carlo

PS. your reply got lost somehow in my SPAM folder, so I apologize for
the late reply