On 24/01/2021 11:48, Ævar Arnfjörð Bjarmason wrote: > Improve the support for invalid UTF-8 haystacks given a non-ASCII > needle when using the PCREv2 backend. > > This is a more complete fix for a bug I started to fix in > 870eea8166 (grep: do not enter PCRE2_UTF mode on fixed matching, > 2019-07-26), now that PCREv2 has the PCRE2_MATCH_INVALID_UTF mode we > can make use of it. > > This fixes the sort of case described in 8a5999838e (grep: stess test > PCRE v2 on invalid UTF-8 data, 2019-07-26), i.e.: > > - The subject string is non-ASCII (e.g. "ævar") > - We're under a is_utf8_locale(), e.g. "en_US.UTF-8", not "C" > - We are using --ignore-case, or we're a non-fixed pattern > > If those conditions were satisfied and we matched found non-valid > UTF-8 data PCREv2 might bark on it, in practice this only happened > under the JIT backend (turned on by default on most platforms). > > Ultimately this fixes a "regression" in b65abcafc7 ("grep: use PCRE v2 > for optimized fixed-string search", 2019-07-01), I'm putting that in > scare-quotes because before then we wouldn't properly support these > complex case-folding, locale etc. cases either, it just broke in > different ways. > > There was a bug related to this the PCRE2_NO_START_OPTIMIZE flag fixed > in PCREv2 10.36. It can be worked around by setting the > PCRE2_NO_START_OPTIMIZE flag. Let's do that in those cases, and add > tests for the bug. > > Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> > --- > Makefile | 1 + > grep.c | 8 +++++- > grep.h | 4 +++ > t/helper/test-pcre2-config.c | 12 +++++++++ > t/helper/test-tool.c | 1 + > t/helper/test-tool.h | 1 + > t/t7812-grep-icase-non-ascii.sh | 46 ++++++++++++++++++++++++++++++++- > 7 files changed, 71 insertions(+), 2 deletions(-) > create mode 100644 t/helper/test-pcre2-config.c > > diff --git a/Makefile b/Makefile > index 4edfda3e00..42a7ed96e2 100644 > --- a/Makefile > +++ b/Makefile > @@ -722,6 +722,7 @@ TEST_BUILTINS_OBJS += test-online-cpus.o > TEST_BUILTINS_OBJS += test-parse-options.o > TEST_BUILTINS_OBJS += test-parse-pathspec-file.o > TEST_BUILTINS_OBJS += test-path-utils.o > +TEST_BUILTINS_OBJS += test-pcre2-config.o > TEST_BUILTINS_OBJS += test-pkt-line.o > TEST_BUILTINS_OBJS += test-prio-queue.o > TEST_BUILTINS_OBJS += test-proc-receive.o > diff --git a/grep.c b/grep.c > index efeb6dc58d..e329f19877 100644 > --- a/grep.c > +++ b/grep.c > @@ -492,7 +492,13 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > !(!opt->ignore_case && (p->fixed || p->is_fixed))) > - options |= PCRE2_UTF; > + options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > + > + if (PCRE2_MATCH_INVALID_UTF && > + options & (PCRE2_UTF | PCRE2_CASELESS) && > + !(PCRE2_MAJOR >= 10 && PCRE2_MAJOR >= 36)) ^^^^^^^^^^^^^^^^^^ I assume that this should be s/_MAJOR/_MINOR/. ;-) > + /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ > + options |= PCRE2_NO_START_OPTIMIZE; > > p->pcre2_pattern = pcre2_compile((PCRE2_SPTR)p->pattern, > p->patternlen, options, &error, &erroffset, ATB, Ramsay Jones