On Wed, Nov 17 2021, René Scharfe wrote: > Am 16.11.21 um 10:38 schrieb Carlo Arenas: >> On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@xxxxxxxxxxxxxx> wrote: >>> >>> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': >>> git grep -h "var" invalid-0x80 >actual && >>> test_cmp expected actual && >>> git grep -h "(*NO_JIT)var" invalid-0x80 >actual && >>> test_cmp expected actual >>> >>> ++ git grep -h var invalid-0x80 >>> ++ test_cmp expected actual >>> ++ test 2 -ne 2 >>> ++ eval 'diff -u' '"$@"' >>> +++ diff -u expected actual >>> ++ git grep -h '(*NO_JIT)var' invalid-0x80 >>> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set >> >> That is exactly what I was worried about, this is not failing one >> test, but making `git grep` unusable in any repository that has any >> binary files that might be reachable by it, and it is likely affecting >> anyone using PCRE older than 10.34 > > Let's have a look at the map. Here are the differences between the > versions regarding use of PCRE2_UTF: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: !opt->ignore_case && (p->fixed || p->is_fixed) > > o h i l master hamza rene2 > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 > 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging > > So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ > should not have this breakage, because it doesn't enable PCRE2_UTF for > literal patterns. PCRE2_UTF will also matter for literal patterns. Try to peel apart the two bytes in "é" and match them under -i with/without PCRE_UTF. I don't know what the right behavior should be here (haven't had time to dig), but it matters for matching.