Re: [PATCH 0/9] [RFC] Changed Paths Bloom Filters

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 02 Feb 2020 19:43:47 +0100

Garima Singh <garimasigit@xxxxxxxxx> writes:
> On 1/20/2020 8:48 AM, Jakub Narebski wrote:

>>>                                      How often a file has been touched
>>> also makes a difference. The performance gains are less dramatic if the 
>>> file has a very sparse history even if it is a deep file.
>> 
>> This looks a bit strange (or maybe I don't understand something).
>> 
>> Bloom filter can answer "no" and "maybe" to subset inclusion query.
>> This means that if file was *not* changed, with great probability the
>> answer from Bloom filter would be "no", and we would skip diff-ing
>> trees (which may terminate early, though).
>> 
>> On the other hand if file was changed by the commit, and the answer from
>> a Bloom filter is "maybe", then we have to perform diffing to make sure.
>
> Yes. What I meant by statement however is that the performance gain i.e. 
> difference in performance between using and not using bloom filters, is not 
> always as dramatic if the history is sparse and the trees aren't touched 
> as often. So it is largely dependent on the shape of the repo and the shape
> of the commit graph. 

It probably depends on the depth of changes in a typical skipped commit,
I think.

If we are getting history for the
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
and the change is contained in libs/input/ directory, we have to unpack
only two trees, independent on the depth of the file we are asking
about.

It could be also possible, if Git is smart enough about it, to halt
early if we are checking only if given path changed rather than
calculating a full difftree.  Say, for example, that the change was in
core/java/org/apache/http/conn/ssl/SSLSocketFactory.java file, while
we were getting the history for the following file
core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl
After unpacking two or three threes we know that the second file was not
changed.  But if we compute full diff, we have to unpack 8 trees.
Quite a difference.

All example paths above came from AOSP repository that was used for
testing different proposed generation numbers v2, see
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh

>>> The numbers from the git and linux repos for instance, are for files 
>>> closer to the root, hence 2x to 5x. 
>> 
>> That is quite nice speedup, anyway (git repository cannot be even
>> considered large; medium -- maybe).
>
> Yeah. Git and Linux served as nice initial test beds. If you have any 
> suggestions for interesting repos it would be worth running performanc 
> investigations on, do let me know! 

If we want repositories with deep path hierarchy, Java projects with
mandated directory structures might be a good choice, for example
Android (AOSP):

  git clone https://android.googlesource.com/platform/frameworks/base/ android-base

It is also quite large repository; in 2019 it had around 874000 commits,
around the same as the Linux kernel repository.

Another large repository is Chromium -- though I don't know if it has
deep filesystem hierarchy.

You can use the list of different large and large-ish repositories from
https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh
Other repositories with large number of commmits not on that list are
LLVM Compiler, GCC (GNU Compiler Collection) -- just converted to Git,
Homebrew, and Ruby on Rails.

>> P.S. I wonder if it would be worth to create some synthetical repository
>> to test performance gains of Bloom filters, perhaps in t/perf...
>> 
>
> I will look into this after I get v1 out on the mailing list. 
> Thanks! 

It would be nice to have, but it can wait.

Keep up the good work!
-- 
Jakub Narębski