Re: [RFC PATCH] grep: default to posix digits with -P

René Scharfe <l.s.r@xxxxxx> · Thu, 4 Jan 2024 22:14:23 +0100

Am 02.01.24 um 20:02 schrieb Carlo Arenas:
> On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@xxxxxx> wrote:
>>
>> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
>>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>>               }
>>>               options |= PCRE2_CASELESS;
>>>       }
>>> -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
>>> -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>>> +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>> +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> -     /*
>>> -      * Work around a JIT bug related to invalid Unicode character handling
>>> -      * fixed in 10.35:
>>> -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> -      */
>>> -     options &= ~PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> +             /*
>>> +              * Work around a JIT bug related to invalid Unicode character handling
>>> +              * fixed in 10.35:
>>> +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> +              */
>>> +             options |= PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
>>> +             if (!opt->perl_digit)
>>> +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>>>  #endif
>>> +#endif
>>
>> Why do we need that extra level of indentation?
>
> I was just trying to simplify the code by including all the logic in
> one single set.
>
> The original lack of indentation that was introduced by later fixes to
> the code was IMHO also misguided since the obvious "objective" as set
> in the original code that added PCRE2_UCP was that it should be used
> whenever UTF was also in use as shown by
> acabd2048ee0ee53728100408970ab45a6dab65e.

One level of indentation is correct because the code in question is part
of a function and it's not nested in a loop or conditional.  And
preprocessor directives don't affect the indentation of C code.  Am I
missing something?

> Of course, we soon found out that the original implementation of
> PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
> exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.
>
> Note that the problematic code is only relevant when JIT is also
> enabled, but JIT is almost always enabled.
>
>> The old code turned PCRE2_UCP on by default and turned it off for older
>> versions. The new code enables PCRE2_UCP only for newer versions.  The
>> result should be the same, no?  So why change that part at all?
>
> Because it gets us a little closer to the real reason why we need to
> disable UCP for anything older than 10.35, and that is that there is a
> bug there that is ONLY relevant if we are using JIT.

How so?  If the same flags are set in the end then it seems like a
lateral movement to me.

Do you want to disable JIT compilation for older versions?

> My hope though is that with the release of 10.43 (currently in RC1),
> 10.34 will become irrelevant soon enough and this whole code could be
> cleaned up further.

Cleanup is good, but if we package the workarounds nicely we can keep
them around indefinitely without them bothering us.  Debian buster
still has a few months of support left and comes with PCRE2 10.32..

>> But the comment is now off, isn't it?  The workaround was turning
>> PCRE2_UCP off for older versions (because those were broken), not
>> turning it on for newer versions (because it would be required by some
>> unfixed regression).
>
> The comment was never correct, because it was turning it off, because
> the combination of JIT + MATCH_INVALID_UTF (which was released in
> 10.34) + UCP is broken.

The code disabled PCRE2_UCP for PCRE2 10.34 and older.  The comment
claimed that this was done as a workaround for a bug fixed in 10.35.
You seem to say the same.  What was incorrect about the comment?

>> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and
>> GIT_PCRE2_VERSION_10_43_OR_HIGHER?  Keeping them separate would help
>> keep the #ifdef parts as short as possible and maintainable
>> independently.
>
> I thought that nesting them would make it simpler to maintain, since
> there are somehow connected anyway.
>
> The additional flags that are added in PCRE 10.43 are ONLY needed and
> useful on top of PCRE2_UCP, which itself only makes sense on top of
> UTF.

This conditional stacking is complicated.  I find the old model where
we say which features we want up front and then disable buggy ones
for specific versions easier to grasp.

>>> @@ -27,13 +34,13 @@ do
>>>       if ! test_have_prereq PERF_GREP_ENGINES_THREADS
>>>       then
>>>               test_perf "grep -P '$pattern'" --prereq PCRE "
>>> -                     git -P grep -f pat || :
>>> +                     git grep -P -f pat || :
>>>               "
>>>       else
>>>               for threads in $GIT_PERF_GREP_THREADS
>>>               do
>>>                       test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE "
>>> -                             git -c grep.threads=$threads -P grep -f pat || :
>>> +                             git -c grep.threads=$threads grep -P -f pat || :
>>
>> What's going on here?  You remove the -P option of git (--no-pager) and
>> add the -P option of git grep (--perl-regexp).  So this perf test never
>> tested PCRE despite its name?  Seems worth a separate patch.
>
> My original code was a dud.  This would make that at least more obvious,

The change is good, but I don't see any connection to the
grep.perl.digit feature that this patch is primarily about,
so I (still) think it deserves its own patch.

>> Do the performance numbers in acabd2048e (grep: correctly identify utf-8
>> characters with \{b,w} in -P, 2023-01-08) still hold up in that light?
>
> FWIW the performance numbers still hold up, but just because I did the
> tests independently using hyperfine, and indeed when I did the first
> version of this patch, I had a nice reproducible description that
> showed how to get 20% better performance while searching the git
> repository itself for something like 4 digits (as used in Copyright
> dates).

Great!

> My idea was to reply to my own post with instructions on how to test
> this, which basically require to also compile the recently released
> 10.43RC1 and build against that.

OK, so this is about the grep.perl.digit feature, right?

What I meant above was: Is the statement in acabd2048e (grep: correctly
identify utf-8 characters with \{b,w} in -P, 2023-01-08) about 20-40%
performance impact (in which direction, by the way) still measurable
with the fixed perf test script?

With the fix and PCRE2 10.42 and PCRE2_UCP I get:

Test                           this tree
----------------------------------------------
7822.1: grep -P '\bhow'        0.05(0.02+0.30)
7822.2: grep -P '\bÆvar'       0.05(0.02+0.29)
7822.3: grep -P '\d+ \bÆvar'   0.05(0.02+0.27)
7822.4: grep -P '\bBelón\b'    0.04(0.02+0.23)
7822.5: grep -P '\w{12}\b'     0.03(0.02+0.13)

... and without PCRE2_UCP:

Test                           this tree
----------------------------------------------
7822.1: grep -P '\bhow'        0.05(0.02+0.26)
7822.2: grep -P '\bÆvar'       0.04(0.02+0.18)
7822.3: grep -P '\d+ \bÆvar'   0.05(0.02+0.26)
7822.4: grep -P '\bBelón\b'    0.05(0.02+0.27)
7822.5: grep -P '\w{12}\b'     0.04(0.02+0.18)

Hmm, inconclusive.  Perhaps the test data is too small?

> Indeed, AFAIK the test would fail if built with an older version of
> PCRE, but wasn't sure if making a prerequisite that hardcoded the
> version in test-tool might be the best approach to prevent that,
> especially with all the libification work.

You could extend test-pcre2-config to report whether that feature is
active.  Performance tests could set prerequisites based on its output.

> PS. your reply got lost somehow in my SPAM folder, so I apologize for
> the late reply

No worries, I need days to reply without any detour through the
spam folder..

René