Re: [PATCH 0/4] Bloom filter experiment

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/16/2018 12:45 AM, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:

2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
...
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.
Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?
512 is somewhat arbitrary, but having a maximum size is not.
In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.

A commit with many changed paths is very rare. The 512 I picked above is enough to cover 99% of commits in each of the repos I sampled when first investigating Bloom filters.

When a Bloom filter response says "maybe yes" (in our case, "maybe not TREESAME"), then we need to verify that it is correct. In the extreme case that every path is changed, then the Bloom filter does nothing but add extra work.

These extreme cases are also not unprecedented: in our Azure Repos codebase, we were using core.autocrlf  to smudge CRLFs to LFs, but when it was time to dogfood VFS for Git, we needed to turn off the smudge filter. So, there is one commit that converts every LF to a CRLF in every text file. Storing a Bloom filter for those ~250,000 entries would take ~256KB for essentially no value. By not storing a filter for this commit, we go immediately to the regular TREESAME check, which would happen for most pathspecs.

This is all to say: having a maximum size is good. 512 is big enough to cover _most_ commits, but not so big that we may store _really_ big filters.

Thanks,
-Stolee




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux