Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

René Scharfe <l.s.r@xxxxxx> · Sat, 16 Oct 2021 18:25:20 +0200

Am 15.10.21 um 22:03 schrieb Junio C Hamano:
> Hamza Mahfooz <someguy@xxxxxxxxxxxxxxxxxxx> writes:
>
>> If we attempt to grep non-ascii log message text with an ascii pattern, we
>
> "with an ascii pattern, when Git is built with and told to use pcre2, we"
>
>> run into the following issue:
>>
>>     $ git log --color --author='.var.*Bjar' -1 origin/master | grep ^Author
>>     grep: (standard input): binary file matches

I get no error message on macOS 11.6, but this result, with the underlined
part in red:

   Author: ??var Arnfjörð Bjarmason <avarab@xxxxxxxxx>
            ^^^^^^^^^^^^^^^^^^

So the pattern matches the second byte of a two-byte character, inserts a
color code in the middle and thus produces invalid output in this case.

>>
>> So, to fix this teach the grep code to use PCRE2_UTF, as long as the log
>> output is encoded in UTF-8.
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && !has_non_ascii(p->pattern)) ||
>> +	    (!opt->ignore_locale && is_utf8_locale() &&
>> +	     has_non_ascii(p->pattern) && !(!opt->ignore_case &&
>> +					    (p->fixed || p->is_fixed))))
>
> That's a mouthful.  It is not obvious what new condition is being
> added.  I had to flip the order to see the only difference is, that
>
>> -	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> -	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
>> +	if ((!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
>> +	    !(!opt->ignore_case && (p->fixed || p->is_fixed))) ||
>> +	    (!opt->ignore_locale && !has_non_ascii(p->pattern)))
>
> ... in addition to the case where the original condition holds, if
> we do not say "ignore locale" and the pattern is ascii-only, we
> apply these two option flags.  And that matches what the proposed
> log message explained as the condition the problem appears.
>
> So,... looks good, I guess.
>
> Thanks, will queue.
>
>
> Addendum.
>
> If we were reordering pieces in the condition, I wonder if there is
> a better way to reorganize it, though.  The original is already
> barely explainable with words, and with this new condition added, I
> am not sure if anybody can phrase the condition in simple words to
> others after staring it for a few minutes.  I can't.
>
> But straightening it out is best left as a future clean-up patch,
> separate from this series.
>

It can be written as:

	literal = !opt->ignore_case && (p->fixed || p->is_fixed);
	if (!opt->ignore_locale) {
		if (!has_non_ascii(p->pattern) ||
		    (is_utf8_locale() && !literal))
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Literal patterns are those that don't use any wildcards or case-folding.
If the text is encoded in UTF-8 then we enable PCRE2_UTF either if the
pattern only consists of ASCII characters, or if the pattern is encoded
in UTF-8 and is not just a literal pattern.

Hmm.  Why enable PCRE2_UTF for literal patterns that consist of only
ASCII chars?

The old condition was (reformatted to better match the new one):

	if (!opt->ignore_locale) {
		if (is_utf8_locale() && has_non_ascii(p->pattern) && !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

Intuitively I'd say the condition should be:

	if (!opt->ignore_locale && is_utf8_locale()) {
		if (has_non_ascii(p->pattern) || !literal)
			options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
	}

If both input and pattern are encoded in UTF-8, enable PCRE2_UTF if we
have to match non-ASCII characters or do more than just literal
matching.

For literal patterns that consist only of ASCII characters we don't need
the cost and complication of PCRE2_UTF.

Makes sense?

René