On Thu, Aug 03, 2023 at 09:18:11AM -0400, Derrick Stolee wrote: > > So I think the revised condition is something like: a commit's Bloom > > filter is reusable when there are no paths with characters >= 0x80 in > > a tree-diff against its first parent. I think that ensuring that there > > are no such paths in both a commit's root tree, as well as its first > > parent's root tree is equivalent, since that removes the possibility of > > such a path showing up in its tree-diff. > > This condition is exactly "we computed the diff to know which paths were > input to the filter" which is as difficult as recomputing the Bloom filter > from scratch. I don't think there is much room to gain a performance > improvement here. I think that's true in the worst case, and certainly for repositories with many tree entries which have characters >= 0x80. But note that it's a heuristic in one direction only. If we know that a commit's root tree (and that of it's first parent, if it has one) is free of any such paths, then it's impossible for the first parent tree-diff to contain such an entry, and therefore we can reuse any existing filter. Of course, a commit's root tree (and its parent) may both have a path whose characters >= 0x80 while still not seeing a corresponding entry show up in the tree-diff if that path is unchanged between a commit and its first parent. I think if we were looking at every tree every time only to realize that we have to go back and compute its changed-path Bloom filter, this would be a non-starter. But since we "cache" the results of our walk via the tree object's flags bits, we can skip looking at many trees. In my testing, this showed a significant improvement on linux.git and git.git. My setup for testing is something like: $ git clone git@xxxxxxxxxx:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} so that .git/objects/info/commit-graph.bak is a commit-graph with v1 changed-path Bloom filters for all commits in generation order. With that in place, I can do: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' , producing the following results on linux.git: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, writing a new commit-graph after upgrading from v1 to v2 filters went from taking 4.163 seconds to 3.348 seconds, for a more modest 1.24x speed-up. Of course, all of this depends on how much of the tree meets the above conditions, so we'd expect worse results on repositories with paths that contain characters >= 0x80. I think we'd want some kind of mechanism (probably via config, not a GIT_TEST environment variable) to control whether or not to upgrade existing filters. Thanks, Taylor