Hi Stolee, On Mon, Jun 15, 2020 at 08:14:50PM +0000, SZEDER Gábor via GitGitGadget wrote: > From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@xxxxxxxxx> > > The file 'dir/subdir/file' can only be modified if its leading > directories 'dir' and 'dir/subdir' are modified as well. > > So when checking modified path Bloom filters looking for commits > modifying a path with multiple path components, then check not only > the full path in the Bloom filters, but all its leading directories as > well. Take care to check these paths in "deepest first" order, > because it's the full path that is least likely to be modified, and > the Bloom filter queries can short circuit sooner. > > This can significantly reduce the average false positive rate, by > about an order of magnitude or three(!), and can further speed up > pathspec-limited revision walks. The table below compares the average > false positive rate and runtime of > > git rev-list HEAD -- "$path" > > before and after this change for 5000+ randomly* selected paths from > each repository: > > Average false Average Average > positive rate runtime runtime > before after before after difference > ------------------------------------------------------------------ > git 3.220% 0.7853% 0.0558s 0.0387s -30.6% > linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% > tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% > > *Path selection was done with the following pipeline: > > git ls-tree -r --name-only HEAD | sort -R | head -n 5000 > > The improvements in runtime are much smaller than the improvements in > average false positive rate, as we are clearly reaching diminishing > returns here. However, all these timings depend on that accessing > tree objects is reasonably fast (warm caches). If we had a partial > clone and the tree objects had to be fetched from a promisor remote, > e.g.: > > $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git > $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ > commit-graph write --reachable > $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ > $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ > rev-list HEAD -- "$path" > > then checking all leading path component can reduce the runtime from > over an hour to a few seconds (and this is with the clone and the > promisor on the same machine). > > This adjusts the tracing values in t4216-log-bloom.sh, which provides a > concrete way to notice the improvement. > > Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx> > Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > --- > revision.c | 35 ++++++++++++++++++++++++++--------- > revision.h | 6 ++++-- > t/t4216-log-bloom.sh | 2 +- > 3 files changed, 31 insertions(+), 12 deletions(-) > > diff --git a/revision.c b/revision.c > index c644c660917..027ae3982b4 100644 > --- a/revision.c > +++ b/revision.c > @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > { > struct pathspec_item *pi; > char *path_alloc = NULL; > - const char *path; > + const char *path, *p; > int last_index; > - int len; > + size_t len; > + int path_component_nr = 0, j; > > if (!revs->commits) > return; > @@ -705,8 +706,22 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) > > len = strlen(path); > > - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); > - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); > + p = path; > + do { > + p = strchrnul(p + 1, '/'); > + path_component_nr++; > + } while (p - path < len); > + > + revs->bloom_keys_nr = path_component_nr; > + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); > + > + p = path; > + for (j = 0; j < revs->bloom_keys_nr; j++) { > + p = strchrnul(p + 1, '/'); > + > + fill_bloom_key(path, p - path, &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } > Somewhat related to our off-list discussion yesterday, there is a bug in both 2.27 and this patch which produces incorrect results when (1) Bloom filters are enabled, and (2) we are doing a revision walk from root with the pathspec '.'. What appears to be going on is that our normalization takes '.' -> '', and then we form a Bloom key based on the empty string, which will return 'definitely not' when querying the Bloom filter some of the time, which should never happen. This is a consequence of never inserting the empty key into the Bloom filter upon generation. As a result, I have patched this in GitHub's fork (which is currently based on 2.27 and doesn't have these patches yet) by doing an early return when 'strlen(path) == 0'. Since it looks like these patches are going to land, here is some clean-up and a fix for the bug that you should feel free to test with and apply on top: --- >8 --- diff --git a/revision.c b/revision.c index 8bd383b1dd..123e72698d 100644 --- a/revision.c +++ b/revision.c @@ -670,10 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) { struct pathspec_item *pi; char *path_alloc = NULL; - const char *path, *p; + char *path, *p; int last_index; size_t len; - int path_component_nr = 0, j; + int path_component_nr = 1, j; if (!revs->commits) return; @@ -698,29 +698,33 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) /* remove single trailing slash from path, if needed */ if (pi->match[last_index] == '/') { - path_alloc = xstrdup(pi->match); - path_alloc[last_index] = '\0'; - path = path_alloc; - } else - path = pi->match; + path_alloc = xstrdup(pi->match); + path_alloc[last_index] = '\0'; + path = path_alloc; + } else { + path = pi->match; + len = pi->len; + } - len = strlen(path); + if (!len) + return; - p = path; do { - p = strchrnul(p + 1, '/'); - path_component_nr++; - } while (p - path < len); + if (is_dir_sep(*p)) { + *p = '\0'; + path_component_nr++; + } + } while (*p++); revs->bloom_keys_nr = path_component_nr; ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); p = path; for (j = 0; j < revs->bloom_keys_nr; j++) { - p = strchrnul(p + 1, '/'); - - fill_bloom_key(path, p - path, &revs->bloom_keys[j], + size_t plen = strlen(p); + fill_bloom_key(p, plen, &revs->bloom_keys[j], revs->bloom_filter_settings); + p += plen; } if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > if (trace2_is_enabled() && !bloom_filter_atexit_registered) { > atexit(trace2_bloom_filter_statistics_atexit); > @@ -720,7 +735,7 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > struct commit *commit) > { > struct bloom_filter *filter; > - int result; > + int result = 1, j; > > if (!revs->repo->objects->commit_graph) > return -1; > @@ -740,9 +755,11 @@ static int check_maybe_different_in_bloom_filter(struct rev_info *revs, > return -1; > } > > - result = bloom_filter_contains(filter, > - revs->bloom_key, > - revs->bloom_filter_settings); > + for (j = 0; result && j < revs->bloom_keys_nr; j++) { > + result = bloom_filter_contains(filter, > + &revs->bloom_keys[j], > + revs->bloom_filter_settings); > + } > > if (result) > count_bloom_filter_maybe++; > @@ -782,7 +799,7 @@ static int rev_compare_tree(struct rev_info *revs, > return REV_TREE_SAME; > } > > - if (revs->bloom_key && !nth_parent) { > + if (revs->bloom_keys_nr && !nth_parent) { > bloom_ret = check_maybe_different_in_bloom_filter(revs, commit); > > if (bloom_ret == 0) > diff --git a/revision.h b/revision.h > index 7c026fe41fc..abbfb4ab59a 100644 > --- a/revision.h > +++ b/revision.h > @@ -295,8 +295,10 @@ struct rev_info { > struct topo_walk_info *topo_walk_info; > > /* Commit graph bloom filter fields */ > - /* The bloom filter key for the pathspec */ > - struct bloom_key *bloom_key; > + /* The bloom filter key(s) for the pathspec */ > + struct bloom_key *bloom_keys; > + int bloom_keys_nr; > + > /* > * The bloom filter settings used to generate the key. > * This is loaded from the commit-graph being used. > diff --git a/t/t4216-log-bloom.sh b/t/t4216-log-bloom.sh > index c7011f33e2c..c13b97d3bda 100755 > --- a/t/t4216-log-bloom.sh > +++ b/t/t4216-log-bloom.sh > @@ -142,7 +142,7 @@ test_expect_success 'setup - add commit-graph to the chain with Bloom filters' ' > > test_bloom_filters_used_when_some_filters_are_missing () { > log_args=$1 > - bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":8,\"definitely_not\":6" > + bloom_trace_prefix="statistics:{\"filter_not_present\":3,\"zero_length_filter\":0,\"maybe\":6,\"definitely_not\":8" > setup "$log_args" && > grep -q "$bloom_trace_prefix" "$TRASH_DIRECTORY/trace.perf" && > test_cmp log_wo_bloom log_w_bloom > -- > gitgitgadget > Thanks, Taylor