Re: [PATCH 4/9] commit-graph: document bloom filter format

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 07 Jan 2020 15:46:49 +0100

"Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> From: Garima Singh <garima.singh@xxxxxxxxxxxxx>
>
> Update the technical documentation for commit-graph-format with BIDX
> and BDAT chunk information.
>
> RFC Notes:
> 1. [Call for advice] We specifically mention that we are using Bloom
>    filters in this technical document. Should this document also be
>    made open to other data structures in the future, with versioning
>    information?

I'm not sure.  In theory we might want to switch to another
probabilistic set inclusion query structure, like xor filters or cuckoo
hashing.

On one hand side we could use separate chunks (e.g. XIDX, XDAT for xor
filters), on the other hand we need only one such structure.  On the
gripping hand this can be left for the future, if needed.

Sidenote: using Bloom filters is somewhat encoded in the name of chunk
(B from Bloom filter).  I don't have a better poposal for 4-char name
(XIDX / XDAT for cXange?  CHDX / CHDT for CHange?  FIDX / FDAT for
changed Files?... I don't know).

>
> 2. [Call for advice] We are also not describing the explicit nature
>    of how we store the bloom filter binary data. Would it be useful
>    to document details about the hash algorithm, the number of hashes
>    and the specific seed values we are using in a separate document,
>    or perhaps in a separate section in this document?

I think it would be best to keep description of the commit graph format
concise.  The details about Bloom filter implementation would be better
put in Documentation/technical/commit-graph.txt in my opinion, together
with reasoning behind it (perhaps borrowing from Derrick Stolee blog
post).

This could be done as a separate patch.

>
> Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> Signed-off-by: Garima Singh <garima.singh@xxxxxxxxxxxxx>
> ---
>  Documentation/technical/commit-graph-format.txt | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
>
> diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt
> index a4f17441ae..6497f19f08 100644
> --- a/Documentation/technical/commit-graph-format.txt
> +++ b/Documentation/technical/commit-graph-format.txt
> @@ -17,6 +17,9 @@ metadata, including:
>  - The parents of the commit, stored using positional references within
>    the graph file.
>  
> +- The bloom filter of the commit carrying the paths that were changed between
> +  the commit and it's first parent.

s/bloom/Bloom/ and s/it's/its/

I am not sure about exact wording, but I could at this time think of a
better but concise way of stating it.

> +
>  These positional references are stored as unsigned 32-bit integers
>  corresponding to the array position within the list of commit OIDs. Due
>  to some special constants we use to track parents, we can store at most
> @@ -93,6 +96,20 @@ CHUNK DATA:
>        positions for the parents until reaching a value with the most-significant
>        bit on. The other bits correspond to the position of the last parent.
>  
> +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) [Optional]
> +      For each commit we store the offset of its bloom filter in the BDAT chunk
> +      as follows:
> +      BIDX[i] = number of 8-byte words in all the bloom filters from commit 0 to
> +		commit i (inclusive)

I think it would be better for consistency and ease of reading to follow
the example of OID Fanout (OIDF) chunk description:

 +  Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional]
 +      The ith entry, BIDX[i], stores the number of 8-byte word blocks
 +      in all Bloom filters from commit 0 up to commit i (inclusive)
 +      in lexicographical order.

Maybe even add the following to make implementing it easier:

 +      Data for Bloom filter for i-th commit spans from BIDX[i-1] to
 +      BIDX[i] (plus header length), where we take BIDX[-1] to be 0.

Is it possible for (BIDX[i] - BIDX[i-1]) to be zero (no Bloom filter),
for example for commits with more than 512 changes?  Or is this case
handled by 1 8-byte word Bloom filter of all bits sets to '1', i.e.
0xffffffffffffffff?

How the case of too many changes is distingushed from the case of no
changes (`git commit --allow-empty`, or `git merge --ours`)?  Is the
case of no changes uninteresting, i.e. Bloom filter consisting of zero,
that is with all bits set to '0'?

> +
> +  Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
> +      * It starts with three 32 bit integers for the

I would say "It starts with the header consisting of three unsigned
32-bit integers:" (but current version is not bad).

I wonder if this metadata should perhaps be put in a separate chunk,
BMET... (Bloom filter METadata).

> +	    - version of the hash algorithm being used

This number not only encodes that the base hash algorithm being used is
32-bit Murmur3 hash, but that 'k' hashes used in the computation are
created out of Murmur3 hash using double hashing technique, and
specifies two specific seed values for this double hashing technique.

[Maybe we should store those two seed values here too?]

It might be important to say that the currently supported version is
'1', and if Git encounters unknown hashing algorithm version it should
not use Bloom filter data.

Unless we store encoded _name_ of the hash algorithm, e.g. bytes
'm','u','r','3' for MurmurHash3_32... though it is about more than
a base hash.

Do we need whole 4 bytes for hash version, or is it for ease of use and
alignment?

> +	    - the number of hashes used in the computation

All right.  Perhaps we should test in the future patches that the value
different from the default of 7 would also work.

Also 8-bits / 1 byte for number of hashes (hash functions) should be
enough: as I have written in prevous reply there is no need for k > 32.

> +	    - the number of bits per entry

This is important for construction of Bloom filter, but I think it is
not necessary to use it -- so it may not be necessary to store it.

Would also fit in a single byte: we don't need exceedingly low false
positive probability.

We could use it to estimate the false positive probability, and...

> +	  * The rest of the chunk is the concatenation of all the computed bloom 
> +	  filters for the commits in lexicographic order.

  +	 * The rest of the chunk is the concatenation of all the computed Bloom 
  +	   filters for the commits in lexicographic order.

It would be, I think, a good idea to make it explicit that BDAT is
present iff BIDX is present (iff == if and only if), i.e. that either
both or neither of those chunks should be present.

> +
>    Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional]
>        This list of H-byte hashes describe a set of B commit-graph files that
>        form a commit-graph chain. The graph position for the ith commit in this

Best,
-- 
Jakub Narębski