On 10/16/2018 12:45 AM, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:
2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
...
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.
Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold? With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?
512 is somewhat arbitrary, but having a maximum size is not.
In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare. In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.
A commit with many changed paths is very rare. The 512 I picked above is
enough to cover 99% of commits in each of the repos I sampled when first
investigating Bloom filters.
When a Bloom filter response says "maybe yes" (in our case, "maybe not
TREESAME"), then we need to verify that it is correct. In the extreme
case that every path is changed, then the Bloom filter does nothing but
add extra work.
These extreme cases are also not unprecedented: in our Azure Repos
codebase, we were using core.autocrlf to smudge CRLFs to LFs, but when
it was time to dogfood VFS for Git, we needed to turn off the smudge
filter. So, there is one commit that converts every LF to a CRLF in
every text file. Storing a Bloom filter for those ~250,000 entries would
take ~256KB for essentially no value. By not storing a filter for this
commit, we go immediately to the regular TREESAME check, which would
happen for most pathspecs.
This is all to say: having a maximum size is good. 512 is big enough to
cover _most_ commits, but not so big that we may store _really_ big filters.
Thanks,
-Stolee