Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 20 Jan 2020 14:48:19 +0100

Garima Singh <garimasigit@xxxxxxxxx> writes:
> On 12/31/2019 11:45 AM, Jakub Narebski wrote:
>> "Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
>>>
>>> Performance Gains: We tested the performance of 'git log -- <path>' on the git
>>> repo, the linux repo and some internal large repos, with a variety of paths
>>> of varying depths.
>>>
>>> On the git and linux repos: We observed a 2x to 5x speed up.
>>>
>>> On a large internal repo with files seated 6-10 levels deep in the tree: We
>>> observed 10x to 20x speed ups, with some paths going up to 28 times faster.
>> 
>> Could you provide some more statistics about this internal repository,
>> such as number of files, number of commits, perhaps also number of all
>> objects?  Thanks in advance.
>> 
>> I wonder why such large difference in performance 2-5x vs 10-20x.  Is it
>> about the depth of the file hierarchy?  How would the numbers look for
>> files seated closer to the root in the same large repository, like 3-5
>> levels deep in the tree?
>
> The internal repository we saw these massive gains on has:
> - 413579 commits. 
> - 183303 files distributed across 34482 folders
> The size on disk is about 17 GiB. 

Thank you for the data.  Such information would be important
consideration to help to find out whether enabling Bloom filters in
given repository would be worth it.

> And yes, the difference is performance gains is mostly because of how 
> deep the files were in the hierarchy.

Right, this is understandable.  If files are diep in hierarchy, then we
have to unpack more tree objects to find out if the file was changed in
a given commit (provided that finding differences do not terminate early
thanks to hierarchical structure of tree objects).

>                                      How often a file has been touched
> also makes a difference. The performance gains are less dramatic if the 
> file has a very sparse history even if it is a deep file.

This looks a bit strange (or maybe I don't understand something).

Bloom filter can answer "no" and "maybe" to subset inclusion query.
This means that if file was *not* changed, with great probability the
answer from Bloom filter would be "no", and we would skip diff-ing
trees (which may terminate early, though).

On the other hand if file was changed by the commit, and the answer from
a Bloom filter is "maybe", then we have to perform diffing to make sure.

>
> The numbers from the git and linux repos for instance, are for files 
> closer to the root, hence 2x to 5x. 

That is quite nice speedup, anyway (git repository cannot be even
considered large; medium -- maybe).

P.S. I wonder if it would be worth to create some synthetical repository
to test performance gains of Bloom filters, perhaps in t/perf...

Best,
-- 
Jakub Narębski