Re: git log exclude pathspec from file - supported? plans?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, Jun 30 2021, Jeff King wrote:

> On Wed, Jun 30, 2021 at 12:59:43PM -0400, Martin Langhoff wrote:
>
>> long time no see! I'm doing some complex git repo spelunking and
>> pushing the boundaries of the pathspec magic for excludes.
>> 
>> Is there a reasonable way to provide a (potentially large) set of
>> excludes? something like
>> 
>>      git log --exclude-pathspec-file paths-to-exclude.txt .
>> 
>> Has there been discussion / patches / plans related to this? I may
>> have some cycles (hopefully!)
>
> You can feed pathspecs via --stdin. So:
>
>   {
> 	echo "--"
> 	sed s/^/:^/ paths-to-exclude.txt
>   } | git log --stdin
>
> works. Obviously it's not as turn-key if you really do have a list of
> paths in a file already, but it's much more flexible.
>
> I'll caution you that the pathspec code is not well-optimized to handle
> a large number of pathspecs. E.g.:
>
>   [no pathspecs]
>   $ time git rev-list HEAD /dev/null
>   real	0m0.033s
>   user	0m0.017s
>   sys	0m0.017s
>
>   [trivial pathspec; now we have to actually open up trees]
>   $ { echo --; echo .; } >input
>   $ time git rev-list HEAD --stdin <input >/dev/null
>   real	0m1.338s
>   user	0m1.294s
>   sys	0m0.045s
>
>   [lots of pathspecs; now we spend loads of time actually matching
>    strings; the ^C is when I got bored and killed it]
>   $ { echo --; git ls-files; } >input
>   $ time git rev-list HEAD --stdin <input >/dev/null
>   ^C
>   real	1m24.406s
>   user	1m24.369s
>   sys	0m0.036s
>
> The problem is that we try to linearly match every pathspec against
> every path we consider, so it's quadratic-ish in the number of files in
> the repo. I played a long time ago with storing non-wildcard pathspecs
> in a trie that we could traverse as we talked the individual trees we
> were matching. It performed well, but IIRC the interface was hacky (I
> had to bolt it specifically onto the way the tree-walker uses
> pathspecs, and the other pathspec matchers didn't benefit at all).
>
> I can probably dig it up if anybody's interested in looking at it.

If it's not too much trouble I'd find it interesting, but I likely won't
do anything with it any time soon.

One of the PCREv2 experiments I had very early WIP work towards was to
create a search index for commit messages, contents etc. and stick it in
something similar to the --changed-paths part of the commit-graph.

The PCREv2 codebase actually has (supposedly) a bug-for-bug compatible
implementation of our wildmatch function as a translator to a PCREv2
regex, I have a brnch somewhere where we run all our wildmatch tests
against it successfully.

So couple that with regex introspection, and a search index that
e.g. creates a trie bloom filter, then as long as your --grep=<RX>,
-G<RX> or pathspec has at least 3 fixed strings among its wildcards we
can ask the bloom filter "is this commit a candidate for this regex
searching this path/commit message/diff/whatever".

So you can have indexed matches for things like '*/test-lib.sh", not
just prefixes or fixed-strings.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux