Re: [PATCH v4 09/15] commit-graph: write Bloom filters to commit graph file

SZEDER Gábor <szeder.dev@xxxxxxxxx> · Sun, 31 May 2020 19:23:49 +0200

On Fri, May 29, 2020 at 09:35:17AM -0400, Derrick Stolee wrote:
> >> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
> >> +    * The ith entry, BIDX[i], stores the number of 8-byte word blocks in all
> > 
> > This is inconsistent with the implementation: according to the code in
> > one of the previous patches these entries are simple byte offsets, not
> > 8-byte word offsets, i.e. the combined size of all modified path
> > Bloom filters can be at most 2^32 bytes.
> 
> The documentation was fixed in 88093289cdc (Documentation: changed-path Bloom
> filters use byte words, 2020-05-11).

Oh, good.  I'm waaay behind the curve and haven't seen this fix.  Even
better, now I also noticed that two bugs I was about to report have
been fixed already (though both fixes have minor flaws).

Ok, so at least the specs are consistent with the implementation.  I'm
not sure this was done in the right direction, though, because too
small Bloom filters do hurt performance.

> > Clearly, using 4 byte index entries significantly lowers the max
> > number of commits that can be stored with modified path Bloom filters.
> 
> This is a good point, and certainly the reason for 8-byte multiples.

Note that Bloom filters with power-of-two number of bits have higher
false positive probabilities when using some form of double hashing.
When going for 8 byte blocks all commits modifying <= 12 paths
(assuming 7 hashes per path) will have power-of-2 sized Bloom filters
(64 or 128 bits), and that is a lot of commits.

> The incremental commit-graph can actually save us here

Oh, I haven't thought of that

> >> +      Bloom filters from commit 0 to commit i (inclusive) in lexicographic
> >> +      order. The Bloom filter for the i-th commit spans from BIDX[i-1] to
> >> +      BIDX[i] (plus header length), where BIDX[-1] is 0.
> >> +    * The BIDX chunk is ignored if the BDAT chunk is not present.
> >> +
> >> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> >> +    * It starts with header consisting of three unsigned 32-bit integers:
> >> +      - Version of the hash algorithm being used. We currently only support
> >> +	value 1 which corresponds to the 32-bit version of the murmur3 hash
> >> +	implemented exactly as described in
> >> +	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
> >> +	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
> >> +	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
> >> +	in Probabilistic Verification"
> > 
> > How should double hashing compute the k hashes, i.e. using 64 bit or
> > 32 bit unsigned integer arithmetic?

Note that this should be clarified in the specs.

> >> +      - The number of times a path is hashed and hence the number of bit positions
> >> +	      that cumulatively determine whether a file is present in the commit.
> >> +      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
> >> +	      contains 'n' entries, then the filter size is the minimum number of 64-bit
> >> +	      words that contain n*b bits.
> > 
> > Since the ideal number of bits per element depends only on the number
> > of hashes per path (k / ln(2) ≈ k * 10 / 7), why is this value stored
> > in the commit-graph?
> 
> The ideal number depends also on what false-positive rate you want.

Well, yes, but indirectly:  according to Wikipedia :) the optimal
number of hashes per element depends only on the desired false
probability, and the optimal number of bits per element depends only
on the number of hashes per element.

So storing the min number of bits per entry seems to be redundant.

> In a
> hypothetical future where we want to allow customization here, we want
> the filters to be consistently sized across all filters.

Wouldn't customizing through the number of hashes be sufficient?

> >> +    * Note: Commits with no changes or more than 512 changes have Bloom filters
> >> +      of length zero.
> > 
> > What does this "Note:" prefix mean in the file format specification?
> > 
> > Can an implementation use a one byte Bloom filter with no bits set for
> > a commit with no changes?  Can an implementation still store a Bloom
> > filter for commits that modify more than 512 paths?
> 
> This is currently due to a hard-coded value in the implementation. It's not a
> requirement of the file format.

Should an implementation detail like that be part of the specs?  It
sure caused a bit of confusion here.