On Tue, Jan 17 2023, Carlo Marcelo Arenas Belón wrote: > When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used > by the pcre2_compile() call, but that would only allow for the > use of Unicode character properties when caseless is required > but not to include the additional UTF characters for all other > class matches. > > This would result in failed matches for expressions that rely > on those properties, for ex: > > $ git grep -P '\bÆvar' > > Add a configuration that could be used to enable the PCRE2_UCP > flag to correctly match those cases, when required. > > The use of this has an impact on performance that has been estimated > to be significant. > > Signed-off-by: Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx> > --- > Changes since v2: > * make setting UCP and opt-in as suggested by Ævar To argue with myself here, I'm not so sure that just making this the default isn't the right move, especially as the GNU grep maintainer seems to be convinced that that's the right thing for grep(1). We've usually just followed GNU grep semantics, so if it's doing X and we're doing Y after this it's probably better to unify our behavior with theirs. I was mainly concerned with the behavior change sneaking in as a mere bugfix, I think it's OK if we change the behavior, as long as we're going into it with our eyes open... > * remove performance test and instead add a test ...I didn't follow the thread(s) where this may have been discussed, but I for one would like to see a perf test with this, but maybe it was removed for a good reason that I'm not aware of... > Documentation/config/grep.txt | 6 ++++++ > grep.c | 11 ++++++++++- > grep.h | 1 + > t/t7810-grep.sh | 13 +++++++++++++ > 4 files changed, 30 insertions(+), 1 deletion(-) > > diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt > index e521f20390..8848db7311 100644 > --- a/Documentation/config/grep.txt > +++ b/Documentation/config/grep.txt > @@ -26,3 +26,9 @@ grep.fullName:: > grep.fallbackToNoIndex:: > If set to true, fall back to git grep --no-index if git grep > is executed outside of a git repository. Defaults to false. > + > +pcre.ucp:: > + If set to true, will use all Unicode Character Properties when matching > + `\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used > + as the underlying engine. If PCRE is not being used it is ignored. > + Defaults to false There's a couple of exceptions to this, but we tend to stick config docs in their corresponding Documentation/config/<namespace>.txt, so this should be in a new Documentation/config/pcre.txt if we're adding this name. But I'd rather that we don't expose the implementation detail that we're using PCRE, which we haven't done so far. We just have a "grep.patternType=perl", which (and that's another issue, in any case) we should explicitly mention here, i.e. that this is for use with that config (and corresponding option(s)). I think calling this e.g.: grep.perl.Unicode=<bool> grep.patternTypePerl.Unicode=<bool> Or even: grep.patternTypePerl.Flags=u Would be better, i.e. PCRE's C API is really just mapping to the flags you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In this case the /u flag maps to the "PCRE2_UCP" API flag. That we happen to use PCRE to give ourselves "Perl" semantics is an implementation detail we should avoid exposing, so we could either give our config generic names, or literally map to the perl /flags/. For now we could just die on any "Flags" value that isn't "u". Of course all of this is predicated on us wanting to leave this as an opt-in, which I'm not so sure about. If it's opt-out we'll avoid this entire question, > diff --git a/grep.c b/grep.c > index 06eed69493..ceafb8937d 100644 > --- a/grep.c > +++ b/grep.c > @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb) > return config_error_nonbool(var); > return color_parse(value, color); > } > + > + if (!strcmp(var, "pcre.ucp")) { > + opt->pcre_ucp = git_config_bool(var, value); > + return 0; > + } > + > return 0; > } > > @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt > } > options |= PCRE2_CASELESS; > } > - if (!opt->ignore_locale && is_utf8_locale() && !literal) > + if (!opt->ignore_locale && is_utf8_locale() && !literal) { > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > + if (opt->pcre_ucp) > + options |= PCRE2_UCP; > + } This interaction with locale settings etc. is probably correct, but if we're keeping the config etc. we should really document how this interacts with those. I.e. you might expect "-c grep.patternType=perl -c <whatever_the_setting_is>=true" to give you UCP semantics, but we'll ignore it based on these other criteria. > #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER > /* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */ > diff --git a/grep.h b/grep.h > index 6075f997e6..082bd3a0c7 100644 > --- a/grep.h > +++ b/grep.h > @@ -171,6 +171,7 @@ struct grep_opt { > int file_break; > int heading; > int max_count; > + int pcre_ucp; > void *priv; The reason for why we have some "bool"-like settings (like "int ignore_case") as an "int" as opposed to an "unsigned int <name>:1" bitfield is because we need to take their address via the parse_options() API. But in this case it's a purely internal field, so (and again, if we're keeping the option) let's use "unsigned int ...:1" here instead? Unless that is, we're expecting a corresponding command-line option. > > void (*output)(struct grep_opt *opt, const void *data, size_t size); > diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh > index 8eded6ab27..a99a967060 100755 > --- a/t/t7810-grep.sh > +++ b/t/t7810-grep.sh > @@ -95,6 +95,7 @@ test_expect_success setup ' > then > echo "¿" >reverse-question-mark > fi && > + echo "émotion" >ucp && Here we carry this file through the entirety ouf our tests... > git add . && > test_tick && > git commit -m initial > @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE > test_cmp hello_world actual > ' > > +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' ' > + cat >expected <<-\EOF && > + ucp:émotion ...only to use it in this one test, it clearly didn't harm anything (or rather, I didn't run this, but I expect you did and it passed), but how about avoiding more global state here doing: test_when_finished "rm -rf repo" && git init repo && test_commit -C repo msg ucp émotion && [...] > + EOF > + cat >pattern <<-\EOF && > + \bémotion > + EOF > + LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual && > + test_cmp expected actual && > + LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern This will break on platforms that don't have en_US.UTF-8 (and that's not hypothetical, some systems will skip installing locales for various reasons). I see we have some almost recent breakage 1819ad327b7 (grep: fix multibyte regex handling under macOS, 2022-08-26), but that one adds a new MB_REGEX prereq, which presumably fails if we don't have this locale. You can just use the existing "GETTEXT_LOCALE", which will piggy-back on our existing locale tests, or we could add a corresponding one for en_US.UTF-8 and refactor the existing MB_REGEX to be in lib-gettext.sh where it arguably belongs... > +' > + > test_expect_success 'grep -G invalidpattern properly dies ' ' > test_must_fail git grep -G "a[" > '