Am 02.01.24 um 20:02 schrieb Carlo Arenas: > On Mon, Jan 1, 2024 at 9:18 AM René Scharfe <l.s.r@xxxxxx> wrote: >> >> Am 01.01.24 um 16:03 schrieb Carlo Marcelo Arenas Belón: >>> @@ -321,17 +327,22 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt >>> } >>> options |= PCRE2_CASELESS; >>> } >>> - if (!opt->ignore_locale && is_utf8_locale() && !literal) >>> - options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF); >>> + if (!opt->ignore_locale && is_utf8_locale() && !literal) { >>> + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); >>> >>> -#ifndef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> - /* >>> - * Work around a JIT bug related to invalid Unicode character handling >>> - * fixed in 10.35: >>> - * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> - */ >>> - options &= ~PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_35_OR_HIGHER >>> + /* >>> + * Work around a JIT bug related to invalid Unicode character handling >>> + * fixed in 10.35: >>> + * https://github.com/PCRE2Project/pcre2/commit/c21bd977547d >>> + */ >>> + options |= PCRE2_UCP; >>> +#ifdef GIT_PCRE2_VERSION_10_43_OR_HIGHER >>> + if (!opt->perl_digit) >>> + xoptions |= (PCRE2_EXTRA_ASCII_BSD | PCRE2_EXTRA_ASCII_DIGIT); >>> #endif >>> +#endif >> >> Why do we need that extra level of indentation? > > I was just trying to simplify the code by including all the logic in > one single set. > > The original lack of indentation that was introduced by later fixes to > the code was IMHO also misguided since the obvious "objective" as set > in the original code that added PCRE2_UCP was that it should be used > whenever UTF was also in use as shown by > acabd2048ee0ee53728100408970ab45a6dab65e. One level of indentation is correct because the code in question is part of a function and it's not nested in a loop or conditional. And preprocessor directives don't affect the indentation of C code. Am I missing something? > Of course, we soon found out that the original implementation of > PCRE2_MATCH_INVALID_UTF that came with PCRE2 10.34 was buggy and so an > exception was introduced in 14b9a044798ebb3858a1f1a1377309a3d6054ac8. > > Note that the problematic code is only relevant when JIT is also > enabled, but JIT is almost always enabled. > >> The old code turned PCRE2_UCP on by default and turned it off for older >> versions. The new code enables PCRE2_UCP only for newer versions. The >> result should be the same, no? So why change that part at all? > > Because it gets us a little closer to the real reason why we need to > disable UCP for anything older than 10.35, and that is that there is a > bug there that is ONLY relevant if we are using JIT. How so? If the same flags are set in the end then it seems like a lateral movement to me. Do you want to disable JIT compilation for older versions? > My hope though is that with the release of 10.43 (currently in RC1), > 10.34 will become irrelevant soon enough and this whole code could be > cleaned up further. Cleanup is good, but if we package the workarounds nicely we can keep them around indefinitely without them bothering us. Debian buster still has a few months of support left and comes with PCRE2 10.32.. >> But the comment is now off, isn't it? The workaround was turning >> PCRE2_UCP off for older versions (because those were broken), not >> turning it on for newer versions (because it would be required by some >> unfixed regression). > > The comment was never correct, because it was turning it off, because > the combination of JIT + MATCH_INVALID_UTF (which was released in > 10.34) + UCP is broken. The code disabled PCRE2_UCP for PCRE2 10.34 and older. The comment claimed that this was done as a workaround for a bug fixed in 10.35. You seem to say the same. What was incorrect about the comment? >> Do we need to nest the checks for GIT_PCRE2_VERSION_10_35_OR_HIGHER and >> GIT_PCRE2_VERSION_10_43_OR_HIGHER? Keeping them separate would help >> keep the #ifdef parts as short as possible and maintainable >> independently. > > I thought that nesting them would make it simpler to maintain, since > there are somehow connected anyway. > > The additional flags that are added in PCRE 10.43 are ONLY needed and > useful on top of PCRE2_UCP, which itself only makes sense on top of > UTF. This conditional stacking is complicated. I find the old model where we say which features we want up front and then disable buggy ones for specific versions easier to grasp. >>> @@ -27,13 +34,13 @@ do >>> if ! test_have_prereq PERF_GREP_ENGINES_THREADS >>> then >>> test_perf "grep -P '$pattern'" --prereq PCRE " >>> - git -P grep -f pat || : >>> + git grep -P -f pat || : >>> " >>> else >>> for threads in $GIT_PERF_GREP_THREADS >>> do >>> test_perf "grep -P '$pattern' with $threads threads" --prereq PTHREADS,PCRE " >>> - git -c grep.threads=$threads -P grep -f pat || : >>> + git -c grep.threads=$threads grep -P -f pat || : >> >> What's going on here? You remove the -P option of git (--no-pager) and >> add the -P option of git grep (--perl-regexp). So this perf test never >> tested PCRE despite its name? Seems worth a separate patch. > > My original code was a dud. This would make that at least more obvious, The change is good, but I don't see any connection to the grep.perl.digit feature that this patch is primarily about, so I (still) think it deserves its own patch. >> Do the performance numbers in acabd2048e (grep: correctly identify utf-8 >> characters with \{b,w} in -P, 2023-01-08) still hold up in that light? > > FWIW the performance numbers still hold up, but just because I did the > tests independently using hyperfine, and indeed when I did the first > version of this patch, I had a nice reproducible description that > showed how to get 20% better performance while searching the git > repository itself for something like 4 digits (as used in Copyright > dates). Great! > My idea was to reply to my own post with instructions on how to test > this, which basically require to also compile the recently released > 10.43RC1 and build against that. OK, so this is about the grep.perl.digit feature, right? What I meant above was: Is the statement in acabd2048e (grep: correctly identify utf-8 characters with \{b,w} in -P, 2023-01-08) about 20-40% performance impact (in which direction, by the way) still measurable with the fixed perf test script? With the fix and PCRE2 10.42 and PCRE2_UCP I get: Test this tree ---------------------------------------------- 7822.1: grep -P '\bhow' 0.05(0.02+0.30) 7822.2: grep -P '\bÆvar' 0.05(0.02+0.29) 7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.27) 7822.4: grep -P '\bBelón\b' 0.04(0.02+0.23) 7822.5: grep -P '\w{12}\b' 0.03(0.02+0.13) ... and without PCRE2_UCP: Test this tree ---------------------------------------------- 7822.1: grep -P '\bhow' 0.05(0.02+0.26) 7822.2: grep -P '\bÆvar' 0.04(0.02+0.18) 7822.3: grep -P '\d+ \bÆvar' 0.05(0.02+0.26) 7822.4: grep -P '\bBelón\b' 0.05(0.02+0.27) 7822.5: grep -P '\w{12}\b' 0.04(0.02+0.18) Hmm, inconclusive. Perhaps the test data is too small? > Indeed, AFAIK the test would fail if built with an older version of > PCRE, but wasn't sure if making a prerequisite that hardcoded the > version in test-tool might be the best approach to prevent that, > especially with all the libification work. You could extend test-pcre2-config to report whether that feature is active. Performance tests could set prerequisites based on its output. > PS. your reply got lost somehow in my SPAM folder, so I apologize for > the late reply No worries, I need days to reply without any detour through the spam folder.. René