Re: grep vs git grep performance?

Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> · Sat, 28 Oct 2017 09:45:55 +0200

On Fri, Oct 27 2017, Joe Perches jotted:

> On Sat, 2017-10-28 at 00:11 +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Oct 27 2017, Joe Perches jotted:
> []
>> > git grep performance has already been
>> > quite successfully improved.
>>
>> ...and I have WIP patches to use the PCRE engine for patterns without -P
>> which I intend to start sending soon after the next release.
>
> One addition that would be quite nice would be
> an option to have regex matches span input lines.
>
> grep v2.54 was the last grep version that allowed
> this and I keep it around just for that.
>
> ie:
>
> $ cat hello.txt
> Hello
> World
> $ grep -P "Hello\s*World" hello.txt
> $ grep-2.5.4 -P "Hello\s*World" hello.txt
> Hello
> World

I'm unable to build 2.5.4 and can't find anything relevant in the
release notes at a quick glance around that time saying that this would
be removed, if you can still build it I'd be interested to see what this
bisects down to in grep.git.

But aside from that, a feature like this constrains the regex
implementation a lot since it's going to need to either match the entire
file as we'd need to do with PCRE, or we'd need to really deeply embed
the core logic of the regex matcher into our grep implementation.

I.e. in this case a more optimal implementation would start by parsing
this regex down:

    ((EXACT "Hello")
     (STAR (POSIXU "\s"))
     (EXACT "World"))

Then when you open the file you can start searching for the fixed-string
"Hello", if you don't find that you're done, if you do you can forward
look-ahead for the fixed "World", and only if you find that do you need
to match the more complex part in the middle.

Whereas our API for the internal regex matchers now is that we find the
boundaries of newlines and batch-match a bunch of lines with a match()
function that takes a string, and if that matches we drill down to what
specific line matches.

Which is not to say that this can't be done without a potentially
unacceptable memory trade-off (i.e. matching the entire file in all
cases), the PCRE2 engine in particular includes some I/O abstractions
that we're not using but could (but I haven't looked into it).

But right now the entire internal API we have is constrained by catering
to the lowest common denominator (a regexec that takes a char*), so
supporting more fancy multi-line matching features can be a PITA since
we'd need to maintain both codepaths.

Or we could make PCRE a hard dependency, which given the performance
advantages I'm increasingly willing to make the case for.