The file 'dir/subdir/file' can only be modified if its leading directories 'dir' and 'dir/subdir' are modified as well. So when checking modified path Bloom filters looking for commits modifying a path with multiple path components, then check not only the full path in the Bloom filters, but all its leading directories as well. Take care to check these paths in "deepest first" order, because it's the full path that is least likely to be modified, and the Bloom filter queries can short circuit sooner. This can significantly reduce the average false positive rate, by about an order of magnitude or three(!), and can further speed up pathspec-limited revision walks. The table below compares the average false positive rate and runtime of git -c core.modifiedPathBloomFilters=1 rev-list HEAD -- "$path" before and after this change for 5000+ randomly selected paths from each repository: Average false Average Average positive rate runtime runtime before after before after difference ------------------------------------------------------------------ android-base 0.526% 0.00416% 0.1727s 0.1503s -13.0% cmssw 0.494% 0.00395% 0.0426s 0.0332s -22.0% cpython 0.213% 0.01661% 0.0940s 0.0858s -8.8% elasticsearch 0.679% 0.00325% 0.0191s 0.0147s -23.2% gcc 0.398% 0.00892% 0.3423s 0.2192s -36.0% gecko-dev 0.472% 0.00073% 0.6243s 0.5134s -17.8% git 0.191% 0.06992% 0.0384s 0.0319s -17.1% glibc 0.392% 0.01639% 0.0425s 0.0296s -30.5% go 0.453% 0.01262% 0.0515s 0.0425s -17.5% jdk 0.662% 0.00643% 0.0083s 0.0070s -15.0% linux 0.434% 0.00749% 0.1102s 0.0911s -17.3% llvm-project 0.511% 0.00391% 0.4865s 0.4290s -11.8% rails 0.476% 0.01313% 0.0464s 0.0410s -11.6% rust 0.457% 0.02551% 0.0590s 0.0462s -21.7% tensorflow 0.511% 0.00824% 0.0642s 0.0517s -19.5% webkit 0.661% 0.00101% 0.3315s 0.2576s -22.3% The improvements in runtime are much smaller than the improvements in average false positive rate, as we are clearly reaching diminishing returns here. However, all these timings depend on that accessing tree objects is reasonably fast (warm caches). If we had a partial clone and the tree objects had to be fetched from a promisor remote, e.g.: $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ commit-graph write --reachable $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ rev-list HEAD -- "$path" then checking all leading path component can reduce the runtime from over an hour to a few seconds (and this is with the clone and the promisor on the same machine). Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx> --- commit-graph.c | 27 ++++++++++++++++++++++----- 1 file changed, 22 insertions(+), 5 deletions(-) diff --git a/commit-graph.c b/commit-graph.c index f9a21ecdfb..1fd1b4f8dd 100644 --- a/commit-graph.c +++ b/commit-graph.c @@ -977,8 +977,10 @@ void init_pathspec_bloom_fields(struct repository *r, for (i = 0; i < pathspec->nr; i++) { struct pathspec_item *pi = &pathspec->items[i]; - const char *path = pi->match; + const char *path = pi->match, *p; size_t len = pi->len; + int path_component_nr = 0, j; + uint32_t *hashes; /* * Pathspec parsing has normalized away any consecutive @@ -988,13 +990,28 @@ void init_pathspec_bloom_fields(struct repository *r, if (path[len - 1] == '/') len--; - pi->modified_path_bloom_hashes_nr = graph->num_modified_path_bloom_hashes; + p = path; + do { + p = strchrnul(p + 1, '/'); + path_component_nr++; + } while (p - path < len); + + pi->modified_path_bloom_hashes_nr = path_component_nr * graph->num_modified_path_bloom_hashes; ALLOC_ARRAY(pi->modified_path_bloom_hashes, pi->modified_path_bloom_hashes_nr); - compute_modified_path_bloom_hashes_for_path(path, len, - graph->num_modified_path_bloom_hashes, - pi->modified_path_bloom_hashes); + p = path; + hashes = pi->modified_path_bloom_hashes + + pi->modified_path_bloom_hashes_nr; + for (j = 0; j < path_component_nr; j++) { + p = strchrnul(p + 1, '/'); + + hashes -= graph->num_modified_path_bloom_hashes; + compute_modified_path_bloom_hashes_for_path(path, + p - path, + graph->num_modified_path_bloom_hashes, + hashes); + } } pathspec->can_use_modified_path_bloom_filters = 1; -- 2.27.0.rc1.431.g5c813f95dc