Re: [RFC PATCH] grep: default to posix digits with -P

Mathias Krause <minipli@xxxxxxxxxxxxxx> · Wed, 10 Jul 2024 15:49:28 +0200

On 02.01.24 20:02, Carlo Arenas wrote:
> On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@xxxxxx> wrote:
>> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón:
>>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>>>               }
>>>               options |= PCRE2_CASELESS;
>>>       }
>>> -     if (!opt->ignore_locale && is_utf8_locale() && !literal)
>>> -             options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);
>>> +     if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>>> +             options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
>>>
>>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> -     /*
>>> -      * Work around a JIT bug related to invalid Unicode character handling
>>> -      * fixed in 10.35:
>>> -      * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> -      */
>>> -     options &= ~PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER
>>> +             /*
>>> +              * Work around a JIT bug related to invalid Unicode character handling
>>> +              * fixed in 10.35:
>>> +              * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d
>>> +              */
>>> +             options |= PCRE2_UCP;
>>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER
>>> +             if (!opt->perl_digit)
>>> +                     xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT);
>>>  #endif
>>> +#endif
>>
>> Why do we need that extra level of indentation?
> 
> I was just trying to simplify the code by including all the logic in
> one single set.
> 
> The original lack of indentation that was introduced by later fixes to
> the code was IMHO also misguided since the obvious "objective" as set
> in the original code that added PCRE2_UCP was that it should be used
> whenever UTF was also in use as shown by
> acabd2048ee0ee53728100408970ab45a6dab65e.
> 
> Of course, we soon found out that the original implementation of
> PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an
> exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8.

My early fix attempt[1] also had it indented but Junio argued against
it[2]. I see no reason why we should change that now?

[1]
https://lore.kernel.org/git/20230323144000.21146-1-minipli@xxxxxxxxxxxxxx/
[2] https://lore.kernel.org/git/xmqq355va1a2.fsf@gitster.g/

> 
> Note that the problematic code is only relevant when JIT is also
> enabled, but JIT is almost always enabled.

Right. But it doesn't hurt to mask a bit that isn't set, the compiler
will figure, I guess.

> 
>> The old code turned PCRE2_UCP on by default and turned it off for older
>> versions. The new code enables PCRE2_UCP only for newer versions.  The
>> result should be the same, no?  So why change that part at all?
> 
> Because it gets us a little closer to the real reason why we need to
> disable UCP for anything older than 10.35, and that is that there is a
> bug there that is ONLY relevant if we are using JIT.
> 
> My hope though is that with the release of 10.43 (currently in RC1),
> 10.34 will become irrelevant soon enough and this whole code could be
> cleaned up further.
> 
>> But the comment is now off, isn't it?  The workaround was turning
>> PCRE2_UCP off for older versions (because those were broken), not
>> turning it on for newer versions (because it would be required by some
>> unfixed regression).
> 
> The comment was never correct, because it was turning it off, because
> the combination of JIT + MATCH_INVALID_UTF (which was released in
> 10.34) + UCP is broken.

And what makes the comment wrong? It's mentioning "JIT", "invalid
Unicode character handling", "bug" and even the URL to the PCRE2 commit
fixing the bug.

Moreover is your proposed change making the comment look wrong as it's
negating the preprocessor test and sets the PCRE2_UCP bit instead of
masking it, suggesting *this* makes it work around the bug, while it's
actually the opposite.

So, yes, IMHO we should leave that part as-is.

> [snip]

Cheers,
Mathias