Re: Suspected git grep regression in git 2.40.0

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 21.03.23 17:33, Junio C Hamano wrote:
> [jc: on the CC: line, summoned a few people who may know a lot more
> about pcre than I do]
> 
> Stephane Odul <stephane@xxxxxxxxxx> writes:
> 
>> We have a CI pipeline on a private repository that started failing
>> consistently while running `git grep -P` commands. The command
>> fails with an exit code of -11 and is failing pretty
>> consistently. With prior versions of git there is no issue
>> whatsoever, but with 2.40.0 it always fails on the same
>> call. Other git grep calls are fine, but that one is failing
>> consistently.
>>
>> I was not able to reproduce locally but my main machine is an M1
>> MacBook Pro, the CI pipeline runs under Kubernetes in AWS and the
>> container is based on Ubuntu 20.04 with the git client installed
>> via the PPA.
>>
>> The error is for this pattern `git grep -cP '^\w+ = json.load'`.
>>
>> As a workaround we tried to download and install the microsoft-git
>> v2.39.2 deb package since it allows us to downgrade, but then the
>> git grep commands just got stuck.
> 
> One "grep -P" related change we had recently between 2.39 and 2.40
> was
> 
>     50b6ad55 (grep: fall back to interpreter if JIT memory
>     allocation fails, 2023-01-31)
> 
> The code tries to figure out at runtime if pcre engine has
> functioning JIT by making an extra JIT compilation of a sample
> pattern and when it fails with a specific reason, fall back to
> interpreted pattern matching.  The change made the code to emit a
> bit more detailed information when it fails, but a controlled exit
> from the codepath should give $?=128, not 11.
> 
> So the above commit may or may not be related.  It could be that the
> version of pcre library linked to Git 2.40 and older Git you are
> running in your CI environment has been updated.

I don't think it's that commit. I just tried the pattern here on a system that
would require the interpreter fallback and one that wouldn't and both didn't
error out (Debian bookworm/sid; libpcre2-8: 10.42-1).

Looking at the history, though, another commit stands out:

  $ git log --oneline --no-merges v2.39.0..v2.40.0 -- grep.c
  fb2ebe72a374 grep API: plug memory leaks by freeing "header_list"
  891c9965fbc0 grep.c: refactor free_grep_patterns()
  50b6ad55b04d grep: fall back to interpreter if JIT memory allocation fails
  acabd2048ee0 grep: correctly identify utf-8 characters with \{b,w} in -P

Commit acabd2048ee0 suspiciously matches the pattern with its "\w" mention.
Looking at the diff gives:

        if (!opt->ignore_locale && is_utf8_locale() && !literal)
-               options |= (PCRE2_UTF | PCRE2_MATCH_INVALID_UTF);
+               options |= (PCRE2_UTF | PCRE2_UCP | PCRE2_MATCH_INVALID_UTF);


So maybe the PCRE2_UCP option isn't supported by the system in question or it
forces PCRE2 to do a verification it didn't do before?

The exit code of -11 could relate to PCRE2's error PCRE2_ERROR_UTF8_ERR9 (-11)
strengthening the suspicion it's related to that change. It's described as:

  PCRE2_ERROR_UTF8_ERR9   5th-byte's two top bits are not 0x80

Maybe the input data isn't valid UTF-8 encoded?

Stephane, can you pin point the file that the grep is failing on and verify it
is indeed a valid UFT-8 encoded file via:

  $ iconv -f UTF-8 your_file > /dev/null && echo OK || echo "Not valid"

If this prints "OK", we need to investigate further. However, if it prints
"Not valid", a broken file encoding is probably the cause and we should
revisit the above change and think of alternatives, as this might be seen as
a regression (even though the file isn't strictly valid UTF-8).

Thanks,
Mathias

> 
> Does it make a difference if you disable JIT by prefixing the
> pattern with (*NO_JIT)?
> 
> Thanks.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux