Re: How to reduce pickaxe times for a particular repo?

Derrick Stolee <derrickstolee@xxxxxxxxxx> · Tue, 28 Jun 2022 09:01:17 -0400

On 6/28/2022 6:50 AM, Pavel Rappo wrote:

Hi Pavel! Welcome.

> I have a repo of the following characteristics:
> 
>   * 1 branch
>   * 100,000 commits

This is not too large.

>   * 1TB in size

This _is_ large.

>   * The tip of the branch has 55,000 files

And again, this is not large.

This means you have some very large files in your repo, perhaps
even binary files that you don't intend to search.

>   * No new commits are expected: the repo is abandoned and kept for
> archaeological purposes.
> 
> Typically, a `git log -S/-G` lookup takes around a minute to complete.
> I would like to significantly reduce that time. How can I do that? I
> can spend up to 10x more disk space, if required. The machine has 10
> cores and 32GB of RAM.

You are using -S<string> or -G<regex> to see which commits change the
number of matches of that <string> or <regex>. If you don't provide a
pathspec, then Git will search every changed file, including those
very large binary files.

Perhaps you'd like to start by providing a pathspec that limits the
search to only the meaningful code files?

As far as I know, Git doesn't have any data structures that can speed
up content-based matches like this. The commit-graph's content-changed
Bloom filters only help Git with questions like "did this specific file
change?" which is not going to be a critical code path in what you're
describing.

I'm not sure what you're actually trying to ask with -S or -G, so maybe
it is worth considering other types of queries, such as -L<n>,<m>:<file>
or something. This is just a shot in the dark, as you might be doing the
only thing you _can_ do to solve your problem.

Thanks,
-Stolee