Re: [PATCH v2 1/7] grep: don't redundantly compile throwaway patterns under threading

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, May 26, 2017 at 2:58 AM, Junio C Hamano <gitster@xxxxxxxxx> wrote:
> Ævar Arnfjörð Bjarmason <avarab@xxxxxxxxx> writes:
>
>> I think it's a pointless distraction to start speculating in this
>> commit message what we're going to do with --debug it if it ever
>> starts emitting some debugging information at pattern execution time.
>
> OK.
>
>> As an aside, I'd very much like to remove both --debug and the
>> --and/--or/--all-match, gives some very rough edges in the UI and how
>> easy it is to make that feature error or segfault, I suspect you might
>> be the only one using it.
>
> I agree that rewriting "grep -e A -e B" to "grep -e A|B" as an
> optimization is an interesting possibility to look into, and I can
> understand that having to support "--and" and "--not" would
> make such an optimization harder to implement. "-e A --and -e B"
> must become "-e A.*B|B.*A" and as you get more terms your unified
> pattern will grow combinatorial, at which point you would be better
> off matching N patterns and combining the result.
>
> Ever saw a user run "ps | grep rogue | grep -v grep" to find a rogue
> process to kill?  That would not work if the rogue process's command
> line has a word "grep".  Because "git grep" is often run on files in
> order to find the location the patterns appear in, "git grep -e
> pattern | grep -v unwanted" shares the same issue--the unwanted
> pattern may appear in the filename, and the downstream "grep -v" may
> filter out a valid hit.  This is why "--not" exists [*1*].  I agree
> that emulating it within the same "concatenate patterns into one"
> optimization you are envisioning may be hard.
>
> Attempting to optimize "--all-match" would share similar difficulty
> with "--and", but your matching now must be done with the entire
> buffer and not go line-by-line.  It was meant to make it possible to
> say "find commits that avarab@ talks about both regex and log", i.e.
>
>         $ git log --author=avarab@ --all-match --grep=log --grep=regex
>
> This is not something you can emulate by piping an output of grep to
> another grep.
>
> But none of the above means you have to give up optimizing.
>
> You can choose not to combine them into a single pattern if certain
> constructions are hard, and do only the easy ones.  If you think
> that harder combinations are not used very often, the result would
> be faster for many cases while not losing useful features, which is
> what we want.

To be clear the point of my mail was not to say "I can't think of a
way to support both of these things, help!", obviously we can continue
to maintain two codepaths. The point was to raise the idea that we
could simply remove the more complex & doomed to forever be slow
codepath.

Obviously there are caveats with the likes of "grep foo | grep bar"
that don't exist with "grep -e foo --and -e bar". I'm less interested
in whether we can come up with cases that wouldn't be possible if this
were removed, than if anyone's using them in practice.

I suspect that to the extent anyone uses this for common things it
could be emulated by --single-line --perl-regexp and e.g. 'foo.*bar'
instead of 'foo' --and 'bar'. I.e. we could offer to AND together your
regexes and match them over the entire content.

If someone needed something more complex we could just show an example
of piping e.g. \0-delimited commit messages into an arbitrary perl
script you provide.

Anyway, I've only looked this over a tiny bit, and I don't know
whether it's worth it to remove this, right now I was just interested
in some reports of what it was used for. I.e. whether anyone uses it
for N-level deep mixed AND/OR branches, or whether it's really just a
lazy way to concat regexes and get around the current limitation of
not being able to match across lines.

> [Footnote]
>
> *1* For human consumption, lack of "--not" may not hurt in the sense
>     that there are workarounds (i.e. you can do without "| grep -v
>     unwanted" and filter irrelevant ones by eyeballing).  But it is
>     essential while scripting and trying to be precise.




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]