Re: [PATCH 0/4] Bloom filter experiment

Junio C Hamano <gitster@xxxxxxxxx> · Tue, 16 Oct 2018 13:45:06 +0900

Derrick Stolee <stolee@xxxxxxxxx> writes:

> 2. The filters are sized according to the number of changes in each
> commit, with a minimum of one 64-bit word.
> ...
> 6. When we compute the Bloom filters, we don't store a filter for
> commits whose first-parent diff has more than 512 paths.

Just being curious but was 512 taken out of thin air or is there
some math behind it, e.g. to limit false positive rate down to
certain threshold?  With a wide-enough bitset, you could store
arbitrary large number of paths with low enough false positive, I
guess, but is there a point where there is too many paths in the
change that gives us diminishing returns and not worth having a
filter in the first place?

In a normal source-code-control context, the set of paths modified
by any single commit ought to be a small subset of the entire paths,
and whole-tree changes ought to be fairly rare.  In a project for
which that assumption does not hold, it might help to have a
negative bloom filter (i.e. "git log -- A" asks "does the commit
modify A?" and the filter would say "we know it does not, because we
threw all the paths that are not touched to the bloom filter"), but
I think that would optimize for a wrong case.