(Finally getting back to this review after working on some other topics for a while..., sorry for the late response). On Wed, Aug 19, 2020 at 03:35:26PM +0200, SZEDER Gábor wrote: > On Tue, Aug 11, 2020 at 04:52:07PM -0400, Taylor Blau wrote: > > When a commit has more than a certain number of changed paths (commonly > > 512), the commit-graph machinery represents it as a zero-length filter. > > This is done since having many entries in the Bloom filter has > > undesirable effects on the false positivity rate. > > This is not the case, the false positive probability depends on the > ratio of the Bloom filter's size and the number of elements it > contains, and we size the filters proportional to the number of > elements they contain, so the number of elements shouldn't affect the > false positive rate. I'm not sure that I understand. I agree that the FPR depends on the ratio between the number of elements in the filter and the filter's "size". But, consider a Bloom filter that is too small to faithfully represent all its elements. Such a filter would likely have all its bits set high, in which case every query would return "maybe", and the FPR would go up. > On the contrary, it's the small filters, up to around 30-35 bytes, > that tend to have larger than expected false positive rate when using > double hashing. I agree that small filters suffer from the same, but I think this is an "in addition" not an "on the contrary". In either case, I don't think that this is an important detail for the commit message. What matters is the representation (that we truncate >= 512 elements to a length-zero filter), not why (that can be found in another commit). I'd have expected to find the rationale in ed591febb4 (bloom.c: core Bloom filter implementation for changed paths., 2020-03-30), but I couldn't find anything there. So, I'll drop this sentence entirely to avoid an unimportant detail. Thanks, Taylor