Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes: >> Let's have a look at the map. Here are the differences between the >> versions regarding use of PCRE2_UTF: >> >> o: opt->ignore_locale >> h: has_non_ascii(p->pattern) >> i: is_utf8_locale() >> l: !opt->ignore_case && (p->fixed || p->is_fixed) >> >> o h i l master hamza rene2 >> 0 0 0 0 0 1 0 >> 0 0 0 1 0 1 0 >> 0 0 1 0 0 1 1 >> 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging >> >> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ >> should not have this breakage, because it doesn't enable PCRE2_UTF for >> literal patterns. > > PCRE2_UTF will also matter for literal patterns. Try to peel apart the > two bytes in "é" and match them under -i with/without PCRE_UTF. Sorry for being late to the party, but doesn't "literal" in the context of this thread mean the column "l" above, i.e. we are not ignoring case and fixed or is_fixed member is set? So "under -i" disqualifies as an example for "will also matter for literal", doesn't it? In hindsight, I guess we could have pushed a bit harder when René's - if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + if (!opt->ignore_locale && is_utf8_locale() && !(!opt->ignore_case && (p->fixed || p->is_fixed))) options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF); in https://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ (is that what is called 'rene2' above?) was raised on Oct 17th to amend/fix Hamza's [v13 3/3]; that would have prevented 'master' from having this breakage? Carlo, in your [PATCH v2] in <20211117102329.95456-1-carenas@xxxxxxxxx>, I see that the #else side for older PCREv2 users essentially reverts what Hamza's [PATCH v13 3/3] did to this area. +#else + if (!opt->ignore_locale && is_utf8_locale() && has_non_ascii(p->pattern) && + !(!opt->ignore_case && (p->fixed || p->is_fixed))) + options |= PCRE2_UTF; +#endif I guess this is a lot of change in the amount of text involved but the least amount of actual change in the behaviour. For those with newer PCREv2, the behaviour would be the same as v2.34.0, and for others, the behaviour would be the same as v2.33.0. Having said all that, because the consensus seems to be that the whole "when we should match in UTF mode" may need to be rethought, I think reverting Hamza's [v13 3/3] would be the simplest way to clean up the mess for v2.34.1 that will give us a cleaner slate to later build on, than applying this patch. So, I dunno. Comments from those involved in the discussion? Thanks.