On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@xxxxxx> wrote: > > Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: > > @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > > } > > options |= PCRE2_CASELESS; > > } > > - if (!opt->ignore_locale && is_utf8_locale() && !literal) > > - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); > > + if (!opt->ignore_locale && is_utf8_locale() && !literal) { > > + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > > > -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER > > - /* > > - * Work around a JIT bug related to invalid Unicode character handling > > - * fixed in 10.35: > > - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > > - */ > > - options &= ~PCRE2_UCP; > > +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER > > + /* > > + * Work around a JIT bug related to invalid Unicode character handling > > + * fixed in 10.35: > > + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d > > + */ > > + options |= PCRE2_UCP; > > +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER > > + if (!opt->perl_digit) > > + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); > > #endif > > +#endif > > Why do we need that extra level of indentation? I was just trying to simplify the code by including all the logic in one single set. The original lack of indentation that was introduced by later fixes to the code was IMHO also misguided since the obvious "objective" as set in the original code that added PCRE2_UCP was that it should be used whenever UTF was also in use as shown by acabd2048ee0ee53728100408970ab45a6dab65e. Of course, we soon found out that the original implementation of PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8. Note that the problematic code is only relevant when JIT is also enabled, but JIT is almost always enabled. > The old code turned PCRE2_UCP on by default and turned it off for older > versions. The new code enables PCRE2_UCP only for newer versions. The > result should be the same, no? So why change that part at all? Because it gets us a little closer to the real reason why we need to disable UCP for anything older than 10.35, and that is that there is a bug there that is ONLY relevant if we are using JIT. My hope though is that with the release of 10.43 (currently in RC1), 10.34 will become irrelevant soon enough and this whole code could be cleaned up further. > But the comment is now off, isn't it? The workaround was turning > PCRE2_UCP off for older versions (because those were broken), not > turning it on for newer versions (because it would be required by some > unfixed regression). The comment was never correct, because it was turning it off, because the combination of JIT + MATCH_INVALID_UTF (which was released in 10.34) + UCP is broken. > Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and > GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help > keep the #ifdef parts as short as possible and maintainable > independently. I thought that nesting them would make it simpler to maintain, since there are somehow connected anyway. The additional flags that are added in PCRE 10.43 are ONLY needed and useful on top of PCRE2_UCP, which itself only makes sense on top of UTF. > > @@ -27,13 +34,13 @@ do > > if ! test_have_prereq PERF_GREP_ENGINES_THREADS > > then > > test_perf "grep -P '$pattern'" --prereq PCRE " > > - git -P grep -f pat || : > > + git grep -P -f pat || : > > " > > else > > for threads in $GIT_PERF_GREP_THREADS > > do > > test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " > > - git -c grep.threads=$threads -P grep -f pat || : > > + git -c grep.threads=$threads grep -P -f pat || : > > What's going on here? You remove the -P option of git (--no-pager) and > add the -P option of git grep (--perl-regexp). So this perf test never > tested PCRE despite its name? Seems worth a separate patch. My original code was a dud. This would make that at least more obvious, > Do the performance numbers in acabd2048e (grep: correctly identify utf-8 > characters with \{b,w} in -P, 2023-01-08) still hold up in that light? FWIW the performance numbers still hold up, but just because I did the tests independently using hyperfine, and indeed when I did the first version of this patch, I had a nice reproducible description that showed how to get 20% better performance while searching the git repository itself for something like 4 digits (as used in Copyright dates). My idea was to reply to my own post with instructions on how to test this, which basically require to also compile the recently released 10.43RC1 and build against that. Indeed, AFAIK the test would fail if built with an older version of PCRE, but wasn't sure if making a prerequisite that hardcoded the version in test-tool might be the best approach to prevent that, especially with all the libification work. Carlo PS. your reply got lost somehow in my SPAM folder, so I apologize for the late reply