Re: [RFC/PATCH 0/7] grep: move from kwset to optional PCRE v2

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 27 Jun 2019 20:21:06 +0200 (CEST)

Hi Ævar,

On Thu, 27 Jun 2019, Ævar Arnfjörð Bjarmason wrote:

>
> On Thu, Jun 27 2019, Johannes Schindelin wrote:
>
> > On Wed, 26 Jun 2019, Johannes Schindelin wrote:
> >
> >> On Wed, 26 Jun 2019, Ævar Arnfjörð Bjarmason wrote:
> >>
> >> > This speeds things up a lot, but as shown in the patches & tests
> >> > changed modifies the behavior where we have \0 in *patterns* (only
> >> > possible with 'grep -f <file>').
> >>
> >> I agree that it is not worth a lot to care about NULs in search patterns.
> >>
> >> So I am in favor of the goal of this patch series.
> >
> > There seems to be a Windows-specific test failure:
> > https://dev.azure.com/gitgitgadget/git/_build/results?buildId=11535&view=ms.vss-test-web.build-test-results-tab&runId=28232&resultId=101315&paneView=debug
> >
> > The output is this:
> >
> > -- snip --
> > not ok 5 - log --grep does not find non-reencoded values (latin1)
> >
> > expecting success:
> > 	git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual
> > &&
> > 	test_must_be_empty actual
> >
> > ++ git log --encoding=ISO-8859-1 --format=%s --grep=é
> > fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits
> > not 0x80
> > -- snap --
> >
> > Any quick ideas? (I _could_ imagine that it is yet another case of passing
> > non-UTF-8-encoded stuff via command-line vs via file, which does not work
> > on Windows.)
>
> This is an existing issue that my patches just happen to uncover. I'm
> working on a v2 which'll fix it.

I just found yet another problem. When using a libpcre2 _without_ JIT
support, I get this:

-- snip --
$ sh t4210-log-i18n.sh -i -V -x || tail -20
test-results/t4210-log-i18n.out
ok 1 - create commits in different encodings
ok 2 - log --grep searches in log output encoding (utf8)
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)
ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
ok 3 # skip log --grep searches in log output encoding (latin1) (missing !MINGW)

skipping test: log --grep does not find non-reencoded values (utf8)
        git log --encoding=utf8 --format=%s --grep=$latin1_e >actual &&
        test_must_be_empty actual

ok 4 # skip log --grep does not find non-reencoded values (utf8) (missing !MINGW)

expecting success:
        git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e >actual &&
        test_must_be_empty actual

++ git log --encoding=ISO-8859-1 --format=%s --grep=é
fatal: pcre2_match failed with error code -8: UTF-8 error: byte 2 top bits not 0x80
error: last command exited with $?=128
not ok 5 - log --grep does not find non-reencoded values (latin1)
#
#               git log --encoding=ISO-8859-1 --format=%s --grep=$utf8_e
#               >actual &&
#               test_must_be_empty actual
#
-- snap --

This is actually a correct error, as we specifically feed non-UTF-8 text
to PCRE2, but we do turn on the PCRE2_UTF flag.

Funnily enough, this error only occurs when `pcre2_jit_on == 0`, i.e. when
we hit the code path that calls `pcre2_match()`. When the alternative code
path is used, `pcre2_jit_match()` is called, and it does _not_ print that
error.

Whatever the bug in libpcre2 that causes the JIT code path to fail on the
Unicode validation, it points to the problem in this code in
`compile_pcre2_pattern()`:

-- snip --
        if (is_utf8_locale() && has_non_ascii(p->pattern))
                options |= PCRE2_UTF;
-- snap --

It only asks whether there is a non-ASCII character in pattern, but we
never bother to see whether the haystack is also encoded in UTF-8. In this
case, it is not...

Ciao,
Dscho

P.S.: Yes, yes, I know, we should run PCRE2 with JIT, and I am trying to
figure out why it is not enabled on Windows when I specifically asked
`./configure` to enable it... Investigating now.

-