Re: [PATCH 0/4] Bloom filter experiment

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 16 Oct 2018 09:03:14 -0400

On 10/16/2018 8:57 AM, Ævar Arnfjörð Bjarmason wrote:
On Tue, Oct 16 2018, Derrick Stolee wrote:

On 10/16/2018 12:45 AM, Junio C Hamano wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:

2. The filters are sized according to the number of changes in each
commit, with a minimum of one 64-bit word.
...
6. When we compute the Bloom filters, we don't store a filter for
commits whose first-parent diff has more than 512 paths.
Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?
512 is somewhat arbitrary, but having a maximum size is not.
In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.
A commit with many changed paths is very rare. The 512 I picked above
is enough to cover 99% of commits in each of the repos I sampled when
first investigating Bloom filters.

When a Bloom filter response says "maybe yes" (in our case, "maybe not
TREESAME"), then we need to verify that it is correct. In the extreme
case that every path is changed, then the Bloom filter does nothing
but add extra work.

These extreme cases are also not unprecedented: in our Azure Repos
codebase, we were using core.autocrlf to smudge CRLFs to LFs, but
when it was time to dogfood VFS for Git, we needed to turn off the
smudge filter. So, there is one commit that converts every LF to a
CRLF in every text file. Storing a Bloom filter for those ~250,000
entries would take ~256KB for essentially no value. By not storing a
filter for this commit, we go immediately to the regular TREESAME
check, which would happen for most pathspecs.

This is all to say: having a maximum size is good. 512 is big enough
to cover _most_ commits, but not so big that we may store _really_ big
filters.
Makes sense. 512 is good enough to hardcode initially, but I couldn't
tell from briefly skimming the patches if it was possible to make this
size dynamic per-repo when the graph/filter is written.
My proof-of-concept has it as a constant, but part of my plan is to make 
these all config options, as of this item in my message:

>>> 2. We need config values for writing and consuming bloom filters, 
but also to override the default settings.

I.e. we might later add some discovery step where we look at N number of
commits at random, until we're satisfied that we've come up with some
average/median number of total (recursive) tree entries & how many tend
to be changed per-commit.

I.e. I can imagine repositories (with some automated changes) where we
have 10k files and tend to change 1k per commit, or ones with 10k files
where we tend to change just 1-10 per commit, which would mean a
larger/smaller filter would be needed / would do.
I'm not sure a dynamic approach would be worth the effort, but I'm open 
to hearing the results of an experiment.

Thanks,
-Stolee