Re: [PATCH v3] grep: correctly identify utf-8 characters with \{b,w} in -P

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Tue, 17 Jan 2023 13:38:50 +0100

On Tue, Jan 17 2023, Carlo Marcelo Arenas Belón wrote:

> When UTF is enabled for a PCRE match, the PCRE2_UTF flag is used
> by the pcre2_compile() call, but that would only allow for the
> use of Unicode character properties when caseless is required
> but not to include the additional UTF characters for all other
> class matches.
>
> This would result in failed matches for expressions that rely
> on those properties, for ex:
>
>   $ git grep -P '\bÆvar'
>
> Add a configuration that could be used to enable the PCRE2_UCP
> flag to correctly match those cases, when required.
>
> The use of this has an impact on performance that has been estimated
> to be significant.
>
> Signed-off-by: Carlo Marcelo Arenas Belón <carenas@xxxxxxxxx>
> ---
> Changes since v2:
> * make setting UCP and opt-in as suggested by Ævar

To argue with myself here, I'm not so sure that just making this the
default isn't the right move, especially as the GNU grep maintainer
seems to be convinced that that's the right thing for grep(1).

We've usually just followed GNU grep semantics, so if it's doing X and
we're doing Y after this it's probably better to unify our behavior with
theirs.

I was mainly concerned with the behavior change sneaking in as a mere
bugfix, I think it's OK if we change the behavior, as long as we're
going into it with our eyes open...

> * remove performance test and instead add a test

...I didn't follow the thread(s) where this may have been discussed, but
I for one would like to see a perf test with this, but maybe it was
removed for a good reason that I'm not aware of...

>  Documentation/config/grep.txt |  6 ++++++
>  grep.c                        | 11 ++++++++++-
>  grep.h                        |  1 +
>  t/t7810-grep.sh               | 13 +++++++++++++
>  4 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/config/grep.txt b/Documentation/config/grep.txt
> index e521f20390..8848db7311 100644
> --- a/Documentation/config/grep.txt
> +++ b/Documentation/config/grep.txt
> @@ -26,3 +26,9 @@ grep.fullName::
>  grep.fallbackToNoIndex::
>  	If set to true, fall back to git grep --no-index if git grep
>  	is executed outside of a git repository.  Defaults to false.
> +
> +pcre.ucp::
> +	If set to true, will use all Unicode Character Properties when matching
> +	`\w`, `\b`, `\d` or the POSIX classes (ex: `[:alnum:]`) and PCRE is used
> +	as the underlying engine. If PCRE is not being used it is ignored.
> +	Defaults to false

There's a couple of exceptions to this, but we tend to stick config docs
in their corresponding Documentation/config/<namespace>.txt, so this
should be in a new Documentation/config/pcre.txt if we're adding this
name.

But I'd rather that we don't expose the implementation detail that we're
using PCRE, which we haven't done so far. We just have a
"grep.patternType=perl", which (and that's another issue, in any case)
we should explicitly mention here, i.e. that this is for use with that
config (and corresponding option(s)).

I think calling this e.g.:

	grep.perl.Unicode=<bool>
	grep.patternTypePerl.Unicode=<bool>

Or even:

	grep.patternTypePerl.Flags=u

Would be better, i.e. PCRE's C API is really just mapping to the flags
you can find in "perldoc perlre" (https://perldoc.perl.org/perlre). In
this case the /u flag maps to the "PCRE2_UCP" API flag.

That we happen to use PCRE to give ourselves "Perl" semantics is an
implementation detail we should avoid exposing, so we could either give
our config generic names, or literally map to the perl /flags/.

For now we could just die on any "Flags" value that isn't "u".

Of course all of this is predicated on us wanting to leave this as an
opt-in, which I'm not so sure about. If it's opt-out we'll avoid this
entire question,

> diff --git a/grep.c b/grep.c
> index 06eed69493..ceafb8937d 100644
> --- a/grep.c
> +++ b/grep.c
> @@ -102,6 +102,12 @@ int grep_config(const char *var, const char *value, void *cb)
>  			return config_error_nonbool(var);
>  		return color_parse(value, color);
>  	}
> +
> +	if (!strcmp(var, "pcre.ucp")) {
> +		opt->pcre_ucp = git_config_bool(var, value);
> +		return 0;
> +	}
> +
>  	return 0;
>  }
>  
> @@ -292,8 +298,11 @@ static void compile_pcre2_pattern(struct grep_pat *p, const struct grep_opt *opt
>  		}
>  		options |= PCRE2_CASELESS;
>  	}
> -	if (!opt->ignore_locale && is_utf8_locale() && !literal)
> +	if (!opt->ignore_locale && is_utf8_locale() && !literal) {
>  		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
> +		if (opt->pcre_ucp)
> +			options |= PCRE2_UCP;
> +	}

This interaction with locale settings etc. is probably correct, but if
we're keeping the config etc. we should really document how this
interacts with those.

I.e. you might expect "-c grep.patternType=perl -c
<whatever_the_setting_is>=true" to give you UCP semantics, but we'll
ignore it based on these other criteria.

>  #ifndef GIT_PCRE2_VERSION_10_36_OR_HIGHER
>  	/* Work around https://bugs.exim.org/show_bug.cgi?id=2642 fixed in 10.36 */
> diff --git a/grep.h b/grep.h
> index 6075f997e6..082bd3a0c7 100644
> --- a/grep.h
> +++ b/grep.h
> @@ -171,6 +171,7 @@ struct grep_opt {
>  	int file_break;
>  	int heading;
>  	int max_count;
> +	int pcre_ucp;
>  	void *priv;

The reason for why we have some "bool"-like settings (like "int
ignore_case") as an "int" as opposed to an "unsigned int <name>:1"
bitfield is because we need to take their address via the
parse_options() API.

But in this case it's a purely internal field, so (and again, if we're
keeping the option) let's use "unsigned int ...:1" here instead?

Unless that is, we're expecting a corresponding command-line option.

>  
>  	void (*output)(struct grep_opt *opt, const void *data, size_t size);
> diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
> index 8eded6ab27..a99a967060 100755
> --- a/t/t7810-grep.sh
> +++ b/t/t7810-grep.sh
> @@ -95,6 +95,7 @@ test_expect_success setup '
>  	then
>  		echo "¿" >reverse-question-mark
>  	fi &&
> +	echo "émotion" >ucp &&

Here we carry this file through the entirety ouf our tests...

>  	git add . &&
>  	test_tick &&
>  	git commit -m initial
> @@ -1474,6 +1475,18 @@ test_expect_success PCRE 'grep -P backreferences work (the PCRE NO_AUTO_CAPTURE
>  	test_cmp hello_world actual
>  '
>  
> +test_expect_success PCRE 'grep -c pcre.ucp -P fixes \b' '
> +	cat >expected <<-\EOF &&
> +	ucp:émotion

...only to use it in this one test, it clearly didn't harm anything (or
rather, I didn't run this, but I expect you did and it passed), but how
about avoiding more global state here doing:

	test_when_finished "rm -rf repo" &&
	git init repo &&
	test_commit -C repo msg ucp émotion &&
	[...]

> +	EOF
> +	cat >pattern <<-\EOF &&
> +	\bémotion
> +	EOF
> +	LC_ALL=en_US.UTF-8 git -c pcre.ucp=true grep -P -f pattern >actual &&
> +	test_cmp expected actual &&
> +	LC_ALL=en_US.UTF-8 test_must_fail git grep -P -f pattern

This will break on platforms that don't have en_US.UTF-8 (and that's not
hypothetical, some systems will skip installing locales for various
reasons).

I see we have some almost recent breakage 1819ad327b7 (grep: fix
multibyte regex handling under macOS, 2022-08-26), but that one adds a
new MB_REGEX prereq, which presumably fails if we don't have this
locale.

You can just use the existing "GETTEXT_LOCALE", which will piggy-back on
our existing locale tests, or we could add a corresponding one for
en_US.UTF-8 and refactor the existing MB_REGEX to be in lib-gettext.sh
where it arguably belongs...

> +'
> +
>  test_expect_success 'grep -G invalidpattern properly dies ' '
>  	test_must_fail git grep -G "a["
>  '