Re: [PATCH v2 10/11] commit-graph: check all leading directories in changed path Bloom filters

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 25 Jun 2020 11:05:04 -0400

On 6/25/2020 3:25 AM, René Scharfe wrote:
> Am 23.06.20 um 19:47 schrieb SZEDER Gábor via GitGitGadget:
>> From: =?UTF-8?q?SZEDER=20G=C3=A1bor?= <szeder.dev@xxxxxxxxx>
>>
>> The file 'dir/subdir/file' can only be modified if its leading
>> directories 'dir' and 'dir/subdir' are modified as well.
>>
>> So when checking modified path Bloom filters looking for commits
>> modifying a path with multiple path components, then check not only
>> the full path in the Bloom filters, but all its leading directories as
>> well.  Take care to check these paths in "deepest first" order,
>> because it's the full path that is least likely to be modified, and
>> the Bloom filter queries can short circuit sooner.
>>
>> This can significantly reduce the average false positive rate, by
>> about an order of magnitude or three(!), and can further speed up
>> pathspec-limited revision walks.  The table below compares the average
>> false positive rate and runtime of
>>
>>   git rev-list HEAD -- "$path"
>>
>> before and after this change for 5000+ randomly* selected paths from
>> each repository:
>>
>>                     Average false           Average        Average
>>                     positive rate           runtime        runtime
>>                   before     after     before     after   difference
>>   ------------------------------------------------------------------
>>   git             3.220%   0.7853%     0.0558s   0.0387s   -30.6%
>>   linux           2.453%   0.0296%     0.1046s   0.0766s   -26.8%
>>   tensorflow      2.536%   0.6977%     0.0594s   0.0420s   -29.2%
>>
>> *Path selection was done with the following pipeline:
>>
>> 	git ls-tree -r --name-only HEAD | sort -R | head -n 5000
>>
>> The improvements in runtime are much smaller than the improvements in
>> average false positive rate, as we are clearly reaching diminishing
>> returns here.  However, all these timings depend on that accessing
>> tree objects is reasonably fast (warm caches).  If we had a partial
>> clone and the tree objects had to be fetched from a promisor remote,
>> e.g.:
>>
>>   $ git clone --filter=tree:0 --bare file://.../webkit.git webkit.notrees.git
>>   $ git -C webkit.git -c core.modifiedPathBloomFilters=1 \
>>         commit-graph write --reachable
>>   $ cp webkit.git/objects/info/commit-graph webkit.notrees.git/objects/info/
>>   $ git -C webkit.notrees.git -c core.modifiedPathBloomFilters=1 \
>>         rev-list HEAD -- "$path"
>>
>> then checking all leading path component can reduce the runtime from
>> over an hour to a few seconds (and this is with the clone and the
>> promisor on the same machine).
>>
>> This adjusts the tracing values in t4216-log-bloom.sh, which provides a
>> concrete way to notice the improvement.
>>
>> Helped-by: Taylor Blau <me@xxxxxxxxxxxx>
>> Helped-by: René Scharfe <l.s.r@xxxxxx>
>> Signed-off-by: SZEDER Gábor <szeder.dev@xxxxxxxxx>
>> Signed-off-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
>> ---
>>  revision.c           | 41 ++++++++++++++++++++++++++++++++---------
>>  revision.h           |  6 ++++--
>>  t/t4216-log-bloom.sh |  2 +-
>>  3 files changed, 37 insertions(+), 12 deletions(-)
>>
>> diff --git a/revision.c b/revision.c
>> index b53377cd52..077888ee51 100644
>> --- a/revision.c
>> +++ b/revision.c
>> @@ -670,9 +670,10 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  {
>>  	struct pathspec_item *pi;
>>  	char *path_alloc = NULL;
>> -	const char *path;
>> +	const char *path, *p;
>>  	int last_index;
>> -	int len;
>> +	size_t len;
>> +	int path_component_nr = 1;
>>
>>  	if (!revs->commits)
>>  		return;
>> @@ -709,8 +710,28 @@ static void prepare_to_use_bloom_filter(struct rev_info *revs)
>>  		return;
>>  	}
>>
>> -	revs->bloom_key = xmalloc(sizeof(struct bloom_key));
>> -	fill_bloom_key(path, len, revs->bloom_key, revs->bloom_filter_settings);
>> +	p = path;
>> +	while (*p) {
>> +		if (is_dir_sep(*p))
>> +			path_component_nr++;
>> +		p++;
>> +	}
>> +
>> +	revs->bloom_keys_nr = path_component_nr;
>> +	ALLOC_ARRAY(revs->bloom_keys, revs->bloom_keys_nr);
>> +
>> +	fill_bloom_key(path, len, &revs->bloom_keys[0],
>> +		       revs->bloom_filter_settings);
>> +	path_component_nr = 1;
>> +
>> +	p = path + len - 1;
> 
> len cannot be 0 at this point, as patch 9 made sure, so this is safe.
> Good.
> 
>> +	while (p > path) {
>> +		if (is_dir_sep(*p))
>> +			fill_bloom_key(path, p - path,
>> +				       &revs->bloom_keys[path_component_nr++],
>> +				       revs->bloom_filter_settings);
>> +		p--;
>> +	}
> 
> This walks the directory hierarchy upwards and adds bloom filters for
> shorter and shorter paths, ("deepest first").  Good.
> 
> And it supports all directory separators.  On Windows that would be
> slash (/) and backslash (\).  I assume paths are normalized to use
> only slashes when bloom filters are written, correct?  Then the lookup
> side needs to normalize a given path to only use slashes as well,
> otherwise paths with backslashes cannot be found.  This part seems to
> be missing.

Yes, that's a good point. We _require_ the paths to be normalized
here to be Unix-style paths or else the Bloom filter keys are
incorrect. Thankfully, they are. Let's make that clear in-code by
using '/' instead of is_dir_sep().

Thanks,
-Stolee