Garima Singh <garimasigit@xxxxxxxxx> writes: > On 1/20/2020 8:48 AM, Jakub Narebski wrote: >>> How often a file has been touched >>> also makes a difference. The performance gains are less dramatic if the >>> file has a very sparse history even if it is a deep file. >> >> This looks a bit strange (or maybe I don't understand something). >> >> Bloom filter can answer "no" and "maybe" to subset inclusion query. >> This means that if file was *not* changed, with great probability the >> answer from Bloom filter would be "no", and we would skip diff-ing >> trees (which may terminate early, though). >> >> On the other hand if file was changed by the commit, and the answer from >> a Bloom filter is "maybe", then we have to perform diffing to make sure. > > Yes. What I meant by statement however is that the performance gain i.e. > difference in performance between using and not using bloom filters, is not > always as dramatic if the history is sparse and the trees aren't touched > as often. So it is largely dependent on the shape of the repo and the shape > of the commit graph. It probably depends on the depth of changes in a typical skipped commit, I think. If we are getting history for the core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl and the change is contained in libs/input/ directory, we have to unpack only two trees, independent on the depth of the file we are asking about. It could be also possible, if Git is smart enough about it, to halt early if we are checking only if given path changed rather than calculating a full difftree. Say, for example, that the change was in core/java/org/apache/http/conn/ssl/SSLSocketFactory.java file, while we were getting the history for the following file core/java/com/android/ims/internal/uce/presence/IPresenceListener.aidl After unpacking two or three threes we know that the second file was not changed. But if we compute full diff, we have to unpack 8 trees. Quite a difference. All example paths above came from AOSP repository that was used for testing different proposed generation numbers v2, see https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh >>> The numbers from the git and linux repos for instance, are for files >>> closer to the root, hence 2x to 5x. >> >> That is quite nice speedup, anyway (git repository cannot be even >> considered large; medium -- maybe). > > Yeah. Git and Linux served as nice initial test beds. If you have any > suggestions for interesting repos it would be worth running performanc > investigations on, do let me know! If we want repositories with deep path hierarchy, Java projects with mandated directory structures might be a good choice, for example Android (AOSP): git clone https://android.googlesource.com/platform/frameworks/base/ android-base It is also quite large repository; in 2019 it had around 874000 commits, around the same as the Linux kernel repository. Another large repository is Chromium -- though I don't know if it has deep filesystem hierarchy. You can use the list of different large and large-ish repositories from https://github.com/derrickstolee/gen-test/blob/master/clone-repos.sh Other repositories with large number of commmits not on that list are LLVM Compiler, GCC (GNU Compiler Collection) -- just converted to Git, Homebrew, and Ruby on Rails. >> P.S. I wonder if it would be worth to create some synthetical repository >> to test performance gains of Bloom filters, perhaps in t/perf... >> > > I will look into this after I get v1 out on the mailing list. > Thanks! It would be nice to have, but it can wait. Keep up the good work! -- Jakub Narębski