On 1/20/2020 8:48 AM, Jakub Narebski wrote: >> How often a file has been touched >> also makes a difference. The performance gains are less dramatic if the >> file has a very sparse history even if it is a deep file. > > This looks a bit strange (or maybe I don't understand something). > > Bloom filter can answer "no" and "maybe" to subset inclusion query. > This means that if file was *not* changed, with great probability the > answer from Bloom filter would be "no", and we would skip diff-ing > trees (which may terminate early, though). > > On the other hand if file was changed by the commit, and the answer from > a Bloom filter is "maybe", then we have to perform diffing to make sure. > Yes. What I meant by statement however is that the performance gain i.e. difference in performance between using and not using bloom filters, is not always as dramatic if the history is sparse and the trees aren't touched as often. So it is largely dependent on the shape of the repo and the shape of the commit graph. >> >> The numbers from the git and linux repos for instance, are for files >> closer to the root, hence 2x to 5x. > > That is quite nice speedup, anyway (git repository cannot be even > considered large; medium -- maybe). > Yeah. Git and Linux served as nice initial test beds. If you have any suggestions for interesting repos it would be worth running performanc investigations on, do let me know! > > P.S. I wonder if it would be worth to create some synthetical repository > to test performance gains of Bloom filters, perhaps in t/perf... > I will look into this after I get v1 out on the mailing list. Thanks! Cheers Garima Singh