On Wed, Nov 17, 2021 at 10:46 AM René Scharfe <l.s.r@xxxxxx> wrote: > > Am 16.11.21 um 10:38 schrieb Carlo Arenas: > > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@xxxxxxxxxxxxxx> wrote: > >> > >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data': > >> git grep -h "var" invalid-0x80 >actual && > >> test_cmp expected actual && > >> git grep -h "(*NO_JIT)var" invalid-0x80 >actual && > >> test_cmp expected actual > >> > >> ++ git grep -h var invalid-0x80 > >> ++ test_cmp expected actual > >> ++ test 2 -ne 2 > >> ++ eval 'diff -u' '"$@"' > >> +++ diff -u expected actual > >> ++ git grep -h '(*NO_JIT)var' invalid-0x80 > >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set > > > > That is exactly what I was worried about, this is not failing one > > test, but making `git grep` unusable in any repository that has any > > binary files that might be reachable by it, and it is likely affecting > > anyone using PCRE older than 10.34 > > Let's have a look at the map. Here are the differences between the > versions regarding use of PCRE2_UTF: > > o: opt->ignore_locale > h: has_non_ascii(p->pattern) > i: is_utf8_locale() > l: !opt->ignore_case && (p->fixed || p->is_fixed) > > o h i l master hamza rene2 > 0 0 0 0 0 1 0 > 0 0 0 1 0 1 0 > 0 0 1 0 0 1 1 > 0 0 1 1 0 1 0 <== 7812.13, confirmed using fprint() debugging > > So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/ > should not have this breakage, because it doesn't enable PCRE2_UTF for > literal patterns. Correct, and indeed I had your expression instead of the ugly one in my draft for v2[1] but then I found the tests were even more broken than reported originally and got worried it might introduce yet another issue because of the brittleness of the rest of the code around it. I am hoping it will be introduced in a follow up series though, and unless this code gets thrown away in the promised refactoring by Ævar which I haven't seen yet and once this is in maint. IMHO and in line with the previous warnings you raised[2] during the development of this feature, it might be worth looking at it more deeply, but at least the proposed patch keeps the feature working in practice (the only case that won't work as expected will be when the expression includes "." with the intention of matching a UTF-8 character and while using PCRE2 older than 10.34) and more importantly fixes the regression that it introduced. Carlo [1] https://lore.kernel.org/git/20211117102329.95456-1-carenas@xxxxxxxxx/ [2] https://lore.kernel.org/git/fc7eb9fc-9521-5484-b05f-9c20086fd485@xxxxxx/