Re: [PATCH v13 3/3] grep/pcre2: fix an edge case concerning ascii patterns and UTF-8 data

Carlo Arenas <carenas@xxxxxxxxx> · Wed, 17 Nov 2021 11:56:25 -0800

On Wed, Nov 17, 2021 at 10:46 AM René Scharfe <l.s.r@xxxxxx> wrote:
>
> Am 16.11.21 um 10:38 schrieb Carlo Arenas:
> > On Tue, Nov 16, 2021 at 1:30 AM Andreas Schwab <schwab@xxxxxxxxxxxxxx> wrote:
> >>
> >> expecting success of 7812.13 'PCRE v2: grep ASCII from invalid UTF-8 data':
> >>         git grep -h "var" invalid-0x80 >actual &&
> >>         test_cmp expected actual &&
> >>         git grep -h "(*NO_JIT)var" invalid-0x80 >actual &&
> >>         test_cmp expected actual
> >>
> >> ++ git grep -h var invalid-0x80
> >> ++ test_cmp expected actual
> >> ++ test 2 -ne 2
> >> ++ eval 'diff -u' '"$@"'
> >> +++ diff -u expected actual
> >> ++ git grep -h '(*NO_JIT)var' invalid-0x80
> >> fatal: pcre2_match failed with error code -22: UTF-8 error: isolated byte with 0x80 bit set
> >
> > That is exactly what I was worried about, this is not failing one
> > test, but making `git grep` unusable in any repository that has any
> > binary files that might be reachable by it, and it is likely affecting
> > anyone using PCRE older than 10.34
>
> Let's have a look at the map.  Here are the differences between the
> versions regarding use of PCRE2_UTF:
>
> o: opt->ignore_locale
> h: has_non_ascii(p->pattern)
> i: is_utf8_locale()
> l: !opt->ignore_case && (p->fixed || p->is_fixed)
>
> o h i l master hamza rene2
> 0 0 0 0      0     1     0
> 0 0 0 1      0     1     0
> 0 0 1 0      0     1     1
> 0 0 1 1      0     1     0  <== 7812.13, confirmed using fprint() debugging
>
> So http://public-inbox.org/git/0ea73e7a-6d43-e223-ab2e-24c684102856@xxxxxx/
> should not have this breakage, because it doesn't enable PCRE2_UTF for
> literal patterns.

Correct, and indeed I had your expression instead of the ugly one in
my draft for v2[1] but then I found the tests were even more broken
than reported originally and got worried it might introduce yet
another issue because of the brittleness of the rest of the code
around it.  I am hoping it will be introduced in a follow up series
though, and unless this code gets thrown away in the promised
refactoring by Ævar which I haven't seen yet and once this is in
maint.

IMHO and in line with the previous warnings you raised[2] during the
development of this feature, it might be worth looking at it more
deeply, but at least the proposed patch keeps the feature working in
practice (the only case that won't work as expected will be when the
expression includes "." with the intention of matching a UTF-8
character and while using PCRE2 older than 10.34) and more importantly
fixes the regression that it introduced.

Carlo

[1] https://lore.kernel.org/git/20211117102329.95456-1-carenas@xxxxxxxxx/
[2] https://lore.kernel.org/git/fc7eb9fc-9521-5484-b05f-9c20086fd485@xxxxxx/