On 6/25/2020 3:25 AM, René Scharfe wrote: > Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget: >> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@xxxxxxxxx> >> >> The file 'dir/subdir/file' can only be modified if its leading >> directories 'dir' and 'dir/subdir' are modified as well. >> >> So when checking modified path Bloom filters looking for commits >> modifying a path with multiple path components, then check not only >> the full path in the Bloom filters, but all its leading directories as >> well. Take care to check these paths in "deepest first" order, >> because it's the full path that is least likely to be modified, and >> the Bloom filter queries can short circuit sooner. >> >> This can significantly reduce the average false positive rate, by >> about an order of magnitude or three(!), and can further speed up >> pathspec-limited revision walks. The table below compares the average >> false positive rate and runtime of >> >> git rev-list HEAD -- "$path" >> >> before and after this change for 5000+ randomly* selected paths from >> each repository: >> >> Average false Average Average >> positive rate runtime runtime >> before after before after difference >> ------------------------------------------------------------------ >> git 3.220% 0.7853% 0.0558s 0.0387s -30.6% >> linux 2.453% 0.0296% 0.1046s 0.0766s -26.8% >> tensorflow 2.536% 0.6977% 0.0594s 0.0420s -29.2% >> >> *Path selection was done with the following pipeline: >> >> git ls-tree -r --name-only HEAD | sort -R | head -n 5000 >> >> The improvements in runtime are much smaller than the improvements in >> average false positive rate, as we are clearly reaching diminishing >> returns here. However, all these timings depend on that accessing >> tree objects is reasonably fast (warm caches). If we had a partial >> clone and the tree objects had to be fetched from a promisor remote, >> e.g.: >> >> $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git >> $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \ >> commit-graph write --reachable >> $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/ >> $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \ >> rev-list HEAD -- "$path" >> >> then checking all leading path component can reduce the runtime from >> over an hour to a few seconds (and this is with the clone and the >> promisor on the same machine). >> >> This adjusts the tracing values in t4216-log-bloom.sh, which provides a >> concrete way to notice the improvement. >> >> Helped-by: Taylor Blau <me@xxxxxxxxxxxx> >> Helped-by: René Scharfe <l.s.r@xxxxxx> >> Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx> >> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> >> --- >> revision.c | 41 ++++++++++++++++++++++++++++++++--------- >> revision.h | 6 ++++-- >> t/t4216-log-bloom.sh | 2 +- >> 3 files changed, 37 insertions(+), 12 deletions(-) >> >> diff --git a/revision.c b/revision.c >> index b53377cd52..077888ee51 100644 >> --- a/revision.c >> +++ b/revision.c >> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> { >> struct pathspec_item *pi; >> char *path_alloc = NULL; >> - const char *path; >> + const char *path, *p; >> int last_index; >> - int len; >> + size_t len; >> + int path_component_nr = 1; >> >> if (!revs->commits) >> return; >> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs) >> return; >> } >> >> - revs->bloom_key = xmalloc(sizeof(struct bloom_key)); >> - fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings); >> + p = path; >> + while (*p) { >> + if (is_dir_sep(*p)) >> + path_component_nr++; >> + p++; >> + } >> + >> + revs->bloom_keys_nr = path_component_nr; >> + ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr); >> + >> + fill_bloom_key(path, len, &revs->bloom_keys[0], >> + revs->bloom_filter_settings); >> + path_component_nr = 1; >> + >> + p = path + len - 1; > > len cannot be 0 at this point, as patch 9 made sure, so this is safe. > Good. > >> + while (p > path) { >> + if (is_dir_sep(*p)) >> + fill_bloom_key(path, p - path, >> + &revs->bloom_keys[path_component_nr++], >> + revs->bloom_filter_settings); >> + p--; >> + } > > This walks the directory hierarchy upwards and adds bloom filters for > shorter and shorter paths, ("deepest first"). Good. > > And it supports all directory separators. On Windows that would be > slash (/) and backslash (\). I assume paths are normalized to use > only slashes when bloom filters are written, correct? Then the lookup > side needs to normalize a given path to only use slashes as well, > otherwise paths with backslashes cannot be found. This part seems to > be missing. Yes, that's a good point. We _require_ the paths to be normalized here to be Unix-style paths or else the Bloom filter keys are incorrect. Thankfully, they are. Let's make that clear in-code by using '/' instead of is_dir_sep(). Thanks, -Stolee