Am 18.11.21 um 19:17 schrieb Junio C Hamano: > Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: > >>> Let's have a look at the map. Here are the differences between the >>> versions regarding use of PCRE2_UTF: >>> >>> o: opt->ignore_locale >>> h: has_non_ascii(p->pattern) >>> i: is_utf8_locale() >>> l: !opt->ignore_case && (p->fixed || p->is_fixed) >>> >>> o h i l master hamza rene2 >>> 0 0 0 0 0 1 0 >>> 0 0 0 1 0 1 0 >>> 0 0 1 0 0 1 1 >>> 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging >>> >>> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ >>> should not have this breakage, because it doesn't enable PCRE2_UTF for >>> literal patterns. >> >> PCRE2_UTF will also matter for literal patterns. Try to peel apart the >> two bytes in "é" and match them under -i with/without PCRE_UTF. > > Sorry for being late to the party, but doesn't "literal" in the > context of this thread mean the column "l" above, i.e. we are not > ignoring case and fixed or is_fixed member is set? So "under -i" > disqualifies as an example for "will also matter for literal", > doesn't it? Correct. > In hindsight, I guess we could have pushed a bit harder when René's > > - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + if (!opt->ignore_locale && is_utf8_locale() && > !(!opt->ignore_case && (p->fixed || p->is_fixed))) > options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); > > in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ > (is that what is called 'rene2' above?) was raised on Oct 17th to > amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from > having this breakage? Yes, that the change I meant with "rene2". > Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@xxxxxxxxx>, > I see that the #else side for older PCREv2 users essentially reverts > what Hamza's [PATCH v13 3/3] did to this area. > > +#else > + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && > + !(!opt->ignore_case && (p->fixed || p->is_fixed))) > + options |= PCRE2_UTF; > +#endif > > I guess this is a lot of change in the amount of text involved but > the least amount of actual change in the behaviour. For those with > newer PCREv2, the behaviour would be the same as v2.34.0, and for > others, the behaviour would be the same as v2.33.0. > > Having said all that, because the consensus seems to be that the > whole "when we should match in UTF mode" may need to be rethought, I > think reverting Hamza's [v13 3/3] would be the simplest way to clean > up the mess for v2.34.1 that will give us a cleaner slate to later > build on, than applying this patch. Makes sense to me. It gives a better starting point to solve the issue afresh without getting entangled in mind-melting boolean expressions. René