On Wed, Jun 30 2021, Jeff King wrote: > On Wed, Jun 30, 2021 at 12:59:43PM -0400, Martin Langhoff wrote: > >> long time no see! I'm doing some complex git repo spelunking and >> pushing the boundaries of the pathspec magic for excludes. >> >> Is there a reasonable way to provide a (potentially large) set of >> excludes? something like >> >> git log --exclude-pathspec-file paths-to-exclude.txt . >> >> Has there been discussion / patches / plans related to this? I may >> have some cycles (hopefully!) > > You can feed pathspecs via --stdin. So: > > { > echo "--" > sed s/^/:^/ paths-to-exclude.txt > } | git log --stdin > > works. Obviously it's not as turn-key if you really do have a list of > paths in a file already, but it's much more flexible. > > I'll caution you that the pathspec code is not well-optimized to handle > a large number of pathspecs. E.g.: > > [no pathspecs] > $ time git rev-list HEAD /dev/null > real 0m0.033s > user 0m0.017s > sys 0m0.017s > > [trivial pathspec; now we have to actually open up trees] > $ { echo --; echo .; } >input > $ time git rev-list HEAD --stdin <input >/dev/null > real 0m1.338s > user 0m1.294s > sys 0m0.045s > > [lots of pathspecs; now we spend loads of time actually matching > strings; the ^C is when I got bored and killed it] > $ { echo --; git ls-files; } >input > $ time git rev-list HEAD --stdin <input >/dev/null > ^C > real 1m24.406s > user 1m24.369s > sys 0m0.036s > > The problem is that we try to linearly match every pathspec against > every path we consider, so it's quadratic-ish in the number of files in > the repo. I played a long time ago with storing non-wildcard pathspecs > in a trie that we could traverse as we talked the individual trees we > were matching. It performed well, but IIRC the interface was hacky (I > had to bolt it specifically onto the way the tree-walker uses > pathspecs, and the other pathspec matchers didn't benefit at all). > > I can probably dig it up if anybody's interested in looking at it. If it's not too much trouble I'd find it interesting, but I likely won't do anything with it any time soon. One of the PCREv2 experiments I had very early WIP work towards was to create a search index for commit messages, contents etc. and stick it in something similar to the --changed-paths part of the commit-graph. The PCREv2 codebase actually has (supposedly) a bug-for-bug compatible implementation of our wildmatch function as a translator to a PCREv2 regex, I have a brnch somewhere where we run all our wildmatch tests against it successfully. So couple that with regex introspection, and a search index that e.g. creates a trie bloom filter, then as long as your --grep=<RX>, -G<RX> or pathspec has at least 3 fixed strings among its wildcards we can ask the bloom filter "is this commit a candidate for this regex searching this path/commit message/diff/whatever". So you can have indexed matches for things like '*/test-lib.sh", not just prefixes or fixed-strings.