Re: [PATCH v6 0/7] Changed path filter hash fix and version bump

Taylor Blau <me@xxxxxxxxxxxx> · Thu, 3 Aug 2023 14:45:54 -0400

On Thu, Aug 03, 2023 at 09:18:11AM -0400, Derrick Stolee wrote:
> > So I think the revised condition is something like: a commit's Bloom
> > filter is reusable when there are no paths with characters >= 0x80 in
> > a tree-diff against its first parent. I think that ensuring that there
> > are no such paths in both a commit's root tree, as well as its first
> > parent's root tree is equivalent, since that removes the possibility of
> > such a path showing up in its tree-diff.
>
> This condition is exactly "we computed the diff to know which paths were
> input to the filter" which is as difficult as recomputing the Bloom filter
> from scratch. I don't think there is much room to gain a performance
> improvement here.

I think that's true in the worst case, and certainly for repositories
with many tree entries which have characters >= 0x80.

But note that it's a heuristic in one direction only. If we know that a
commit's root tree (and that of it's first parent, if it has one) is
free of any such paths, then it's impossible for the first parent
tree-diff to contain such an entry, and therefore we can reuse any
existing filter.

Of course, a commit's root tree (and its parent) may both have a path
whose characters >= 0x80 while still not seeing a corresponding entry
show up in the tree-diff if that path is unchanged between a commit and
its first parent.

I think if we were looking at every tree every time only to realize that
we have to go back and compute its changed-path Bloom filter, this would
be a non-starter. But since we "cache" the results of our walk via the
tree object's flags bits, we can skip looking at many trees.

In my testing, this showed a significant improvement on linux.git and
git.git. My setup for testing is something like:

    $ git clone git@xxxxxxxxxx:torvalds/linux.git
    $ cd linux
    $ git commit-graph write --reachable --changed-paths
    $ graph=".git/objects/info/commit-graph"
    $ mv $graph{,.bak}

so that .git/objects/info/commit-graph.bak is a commit-graph with v1
changed-path Bloom filters for all commits in generation order.

With that in place, I can do:

    $ git config commitGraph.changedPathsVersion 2
    $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \
        'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths'

, producing the following results on linux.git:

    Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     124.873 s ±  0.316 s    [User: 124.081 s, System: 0.643 s]
      Range (min … max):   124.621 s … 125.227 s    3 runs

    Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths
      Time (mean ± σ):     79.271 s ±  0.163 s    [User: 74.611 s, System: 4.521 s]
      Range (min … max):   79.112 s … 79.437 s    3 runs

    Summary
      'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran
        1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths'

On git.git, writing a new commit-graph after upgrading from v1 to v2
filters went from taking 4.163 seconds to 3.348 seconds, for a more
modest 1.24x speed-up.

Of course, all of this depends on how much of the tree meets the above
conditions, so we'd expect worse results on repositories with paths that
contain characters >= 0x80. I think we'd want some kind of mechanism
(probably via config, not a GIT_TEST environment variable) to control
whether or not to upgrade existing filters.

Thanks,
Taylor