Derrick Stolee <stolee@xxxxxxxxx> writes: > 2. The filters are sized according to the number of changes in each > commit, with a minimum of one 64-bit word. > ... > 6. When we compute the Bloom filters, we don't store a filter for > commits whose first-parent diff has more than 512 paths. Just being curious but was 512 taken out of thin air or is there some math behind it, e.g. to limit false positive rate down to certain threshold? With a wide-enough bitset, you could store arbitrary large number of paths with low enough false positive, I guess, but is there a point where there is too many paths in the change that gives us diminishing returns and not worth having a filter in the first place? In a normal source-code-control context, the set of paths modified by any single commit ought to be a small subset of the entire paths, and whole-tree changes ought to be fairly rare. In a project for which that assumption does not hold, it might help to have a negative bloom filter (i.e. "git log -- A" asks "does the commit modify A?" and the filter would say "we know it does not, because we threw all the paths that are not touched to the bloom filter"), but I think that would optimize for a wrong case.