On Fri, Sep 18, 2020 at 12:13:02AM +0200, SZEDER Gábor wrote: > > Clients will invoke the Bloom machinery in more cases than before, but > > this can be addressed by returning a NULL filter when all bits are set > > high. This can be addressed in a future patch. > > OTOH, clients will invoke the tree-diff machinery in fewer cases than > before, because querying the Bloom filter of commits not modifying any > files will now return "definitely not". Absolutely right. > > Finally, note that this does increase the size of on-disk commit-graphs, > > but far less than other proposals. In particular, this is generally more > > efficient than storing a bitmap for which commits haven't computed their > > Bloom filters. Storing a bitmap incurs a penalty of one bit per commit, > > whereas storing explicit filters as above incurs a penalty of one byte > > per too-large or too-small commit. > > s/too-small/empty/ Fair enough, although I'm not planning to alter this or any other patch now that it's picked up unless there's a real show-stopper (this doesn't seem like one). > > In practice, these boundary commits likely occupy a small proportion of > > the overall number of commits, and so the size penalty is likely smaller > > than storing a bitmap for all commits. > > | Percentage of > | commits modifying > | 0 path | >= 512 paths > ---------------+------------+---------------- > android-base | 13.20% | 0.13% > cmssw | 0.15% | 0.23% > cpython | 3.07% | 0.01% > elasticsearch | 0.70% | 1.00% > gcc | 0.00% | 0.08% > gecko-dev | 0.14% | 0.64% > git | 0.11% | 0.02% > glibc | 0.02% | 0.10% > go | 0.00% | 0.07% > homebrew-cask | 0.40% | 0.02% > homebrew-core | 0.01% | 0.01% > jdk | 0.26% | 5.64% > linux | 0.01% | 0.51% > llvm-project | 0.12% | 0.03% > rails | 0.10% | 0.10% > rust | 0.07% | 0.17% > tensorflow | 0.09% | 1.02% > webkit | 0.05% | 0.31% This is very useful information to have! Without the total number of commits, it's impossible to know whether or not this is a win over the BFXL chunk. But, since the number of commits is probably "large" versus the percentage of boundary commits which is "small", it's almost certainly an advantage. > > A test to exercise filters which contain too many changed path entries > > will be introduced in a subsequent patch. > > > > diff --git a/bloom.h b/bloom.h > > index c6d77e8393..70a8840896 100644 > > --- a/bloom.h > > +++ b/bloom.h > > @@ -93,6 +93,7 @@ enum bloom_filter_computed { > > BLOOM_NOT_COMPUTED = (1 << 0), > > BLOOM_COMPUTED = (1 << 1), > > BLOOM_TRUNC_LARGE = (1 << 2), > > + BLOOM_TRUNC_SMALL = (1 << 3), > > s/SMALL/EMPTY/ > > This "small" suffix in the constant, variable, and trace2 key names is > misleading, because we only mean empty commits. I could buy that it might be misleading; I only picked this since it was the opposite of "large". You could imagine that BLOOM_TRUNC_X means "truncated in the direction of 'x'", but to be honest I don't think that this matters. I understand the churn of coming back to this after the topic has already been merged creates more hassle, but frankly this series has already gone on for quite a while, and it has been holding up important bug fixes that are unrelated to the main feature. So, I think that if it's truly misleading, we could revisit this after the topic is merged. But, I'm not planning on changing anything at this point. Thanks, Taylor