On Sun, Jan 30, 2022 at 08:55:02AM +0100, René Scharfe wrote: > Am 29.01.22 um 18:25 schrieb SZEDER Gábor: > > On Sat, Dec 18, 2021 at 08:50:02PM +0100, René Scharfe wrote: > >> compile_pcre2_pattern() currently uses the option PCRE2_UTF only for > >> patterns with non-ASCII characters. Patterns with ASCII wildcards can > >> match non-ASCII strings, though. Without that option PCRE2 mishandles > >> UTF-8 input, though -- it matches parts of multi-byte characters. Fix > >> that by using PCRE2_UTF even for ASCII-only patterns. > >> > >> This is a remake of the reverted ae39ba431a (grep/pcre2: fix an edge > >> case concerning ascii patterns and UTF-8 data, 2021-10-15). The change > >> to the condition and the test are simplified and more targeted. > >> > >> Original-patch-by: Hamza Mahfooz <someguy@xxxxxxxxxxxxxxxxxxx> > >> Signed-off-by: René Scharfe <l.s.r@xxxxxx> > >> --- > >> grep.c | 2 +- > >> t/t7812-grep-icase-non-ascii.sh | 6 ++++++ > >> 2 files changed, 7 insertions(+), 1 deletion(-) > >> > >> diff --git a/grep.c b/grep.c > >> index fe847a0111..5badb6d851 100644 > >> --- a/grep.c > >> +++ b/grep.c > >> @@ -382,7 +382,7 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > >> } > >> options |= PCRE2_CASELESS; > >> } > >> - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > >> + if (!opt->ignore_locale && is_utf8_locale() && > >> !(!opt->ignore_case && (p->fixed || p->is_fixed))) > >> options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > >> > > > > I tried to use 'git grep -P' for the first time ever, and it hung > > right away, spinning all CPUs at 100%. I could narrow it down, both > > the complexity of the pattern and the size of the input, see the test > > below, and it bisects to this patch. > > > > > > --- >8 --- > > > > #!/bin/sh > > > > test_description='test' > > > > . ./test-lib.sh > > > > test_expect_success PCRE 'test' ' > > # LC_ALL=C works > > LC_ALL=en_US.UTF-8 && > > cat >ascii <<-\EOF && > > foo > > bar > > baz > > EOF > > cat >utf8 <<-\EOF && > > foo > > bar > > báz > > EOF > > git add ascii utf8 && > > > > # These all work as expected: > > git grep --threads=1 -P " " ascii && > > git grep --threads=1 -P "^ " ascii && > > git grep --threads=1 -P "\s" ascii && > > git grep --threads=1 -P "^\s" ascii && > > git grep --threads=1 -P " " utf8 && > > git grep --threads=1 -P "^ " utf8 && > > git grep --threads=1 -P "\s" utf8 && > > > > # This hangs (but it does work with basic and extended regexp): > > git grep --threads=1 -P "^\s" utf8 > > ' > > > > test_done > > I get the following result and no hang with PCRE2 10.39: > > utf8: bar > utf8: báz > > e0c6029 (Fix inifinite loop when a single byte newline is searched in > JIT., 2020-05-29) [1] sounds like it might have fixed it. It's part of > version 10.36. I saw this hang on two Ubuntu 20.04 based boxes, which predate that fix you mention only by a month or two, and apparently the almost two years since then was not enough for this fix to trickle down into updated 20.04 pcre packages, because: > Do you still get the error when you disable JIT, i.e. when you use the > pattern "(*NO_JIT)^\s" instead? No, with this pattern it works as expected. So is there a more convenient way to disable PCRE JIT in Git? FWIW, (non-git) 'grep -P' works with the same patterns. > René > > > [1] https://github.com/PhilipHazel/pcre2/commit/e0c6029a62db9c2161941ecdf459205382d4d379