Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

Junio C Hamano <gitster@xxxxxxxxx> · Thu, 18 Nov 2021 10:17:59 -0800

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:

>> Let's have a look at the map.  Here are the differences between the
>> versions regarding use of PCRE2_UTF:
>>
>> o: opt->ignore_locale
>> h: has_non_ascii(p->pattern)
>> i: is_utf8_locale()
>> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>>
>> o h i l master hamza rene2
>> 0 0 0 0      0     1     0
>> 0 0 0 1      0     1     0
>> 0 0 1 0      0     1     1
>> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>>
>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/
>> should not have this breakage, because it doesn't enable PCRE2_UTF for
>> literal patterns.
>
> PCRE2_UTF will also matter for literal patterns. Try to peel apart the
> two bytes in "é" and match them under -i with/without PCRE_UTF.

Sorry for being late to the party, but doesn't "literal" in the
context of this thread mean the column "l" above, i.e. we are not
ignoring case and fixed or is_fixed member is set?  So "under -i"
disqualifies as an example for "will also matter for literal",
doesn't it?

In hindsight, I guess we could have pushed a bit harder when René's

-	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	if (!opt->ignore_locale && is_utf8_locale() &&
 	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
 		options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);

in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/
(is that what is called 'rene2' above?) was raised on Oct 17th to
amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from
having this breakage?

Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@xxxxxxxxx>,
I see that the #else side for older PCREv2 users essentially reverts
what Hamza's [PATCH v13 3/3] did to this area.

+#else
+	if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) &&
+	    !(!opt->ignore_case && (p->fixed || p->is_fixed)))
+		options |= PCRE2_UTF;
+#endif

I guess this is a lot of change in the amount of text involved but
the least amount of actual change in the behaviour.  For those with
newer PCREv2, the behaviour would be the same as v2.34.0, and for
others, the behaviour would be the same as v2.33.0.

Having said all that, because the consensus seems to be that the
whole "when we should match in UTF mode" may need to be rethought, I
think reverting Hamza's [v13 3/3] would be the simplest way to clean
up the mess for v2.34.1 that will give us a cleaner slate to later
build on, than applying this patch.

So, I dunno.  Comments from those involved in the discussion?

Thanks.