"Garima Singh via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: > From: Garima Singh <garima.singh@xxxxxxxxxxxxx> > > Update the technical documentation for commit-graph-format with BIDX > and BDAT chunk information. > > RFC Notes: > 1. [Call for advice] We specifically mention that we are using Bloom > filters in this technical document. Should this document also be > made open to other data structures in the future, with versioning > information? I'm not sure. In theory we might want to switch to another probabilistic set inclusion query structure, like xor filters or cuckoo hashing. On one hand side we could use separate chunks (e.g. XIDX, XDAT for xor filters), on the other hand we need only one such structure. On the gripping hand this can be left for the future, if needed. Sidenote: using Bloom filters is somewhat encoded in the name of chunk (B from Bloom filter). I don't have a better poposal for 4-char name (XIDX / XDAT for cXange? CHDX / CHDT for CHange? FIDX / FDAT for changed Files?... I don't know). > > 2. [Call for advice] We are also not describing the explicit nature > of how we store the bloom filter binary data. Would it be useful > to document details about the hash algorithm, the number of hashes > and the specific seed values we are using in a separate document, > or perhaps in a separate section in this document? I think it would be best to keep description of the commit graph format concise. The details about Bloom filter implementation would be better put in Documentation/technical/commit-graph.txt in my opinion, together with reasoning behind it (perhaps borrowing from Derrick Stolee blog post). This could be done as a separate patch. > > Helped-by: Derrick Stolee <dstolee@xxxxxxxxxxxxx> > Signed-off-by: Garima Singh <garima.singh@xxxxxxxxxxxxx> > --- > Documentation/technical/commit-graph-format.txt | 17 +++++++++++++++++ > 1 file changed, 17 insertions(+) > > diff --git a/Documentation/technical/commit-graph-format.txt b/Documentation/technical/commit-graph-format.txt > index a4f17441ae..6497f19f08 100644 > --- a/Documentation/technical/commit-graph-format.txt > +++ b/Documentation/technical/commit-graph-format.txt > @@ -17,6 +17,9 @@ metadata, including: > - The parents of the commit, stored using positional references within > the graph file. > > +- The bloom filter of the commit carrying the paths that were changed between > + the commit and it's first parent. s/bloom/Bloom/ and s/it's/its/ I am not sure about exact wording, but I could at this time think of a better but concise way of stating it. > + > These positional references are stored as unsigned 32-bit integers > corresponding to the array position within the list of commit OIDs. Due > to some special constants we use to track parents, we can store at most > @@ -93,6 +96,20 @@ CHUNK DATA: > positions for the parents until reaching a value with the most-significant > bit on. The other bits correspond to the position of the last parent. > > + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) [Optional] > + For each commit we store the offset of its bloom filter in the BDAT chunk > + as follows: > + BIDX[i] = number of 8-byte words in all the bloom filters from commit 0 to > + commit i (inclusive) I think it would be better for consistency and ease of reading to follow the example of OID Fanout (OIDF) chunk description: + Bloom Filter Index (ID: {'B', 'I', 'D', 'X'}) (N * 4 bytes) [Optional] + The ith entry, BIDX[i], stores the number of 8-byte word blocks + in all Bloom filters from commit 0 up to commit i (inclusive) + in lexicographical order. Maybe even add the following to make implementing it easier: + Data for Bloom filter for i-th commit spans from BIDX[i-1] to + BIDX[i] (plus header length), where we take BIDX[-1] to be 0. Is it possible for (BIDX[i] - BIDX[i-1]) to be zero (no Bloom filter), for example for commits with more than 512 changes? Or is this case handled by 1 8-byte word Bloom filter of all bits sets to '1', i.e. 0xffffffffffffffff? How the case of too many changes is distingushed from the case of no changes (`git commit --allow-empty`, or `git merge --ours`)? Is the case of no changes uninteresting, i.e. Bloom filter consisting of zero, that is with all bits set to '0'? > + > + Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional] > + * It starts with three 32 bit integers for the I would say "It starts with the header consisting of three unsigned 32-bit integers:" (but current version is not bad). I wonder if this metadata should perhaps be put in a separate chunk, BMET... (Bloom filter METadata). > + - version of the hash algorithm being used This number not only encodes that the base hash algorithm being used is 32-bit Murmur3 hash, but that 'k' hashes used in the computation are created out of Murmur3 hash using double hashing technique, and specifies two specific seed values for this double hashing technique. [Maybe we should store those two seed values here too?] It might be important to say that the currently supported version is '1', and if Git encounters unknown hashing algorithm version it should not use Bloom filter data. Unless we store encoded _name_ of the hash algorithm, e.g. bytes 'm','u','r','3' for MurmurHash3_32... though it is about more than a base hash. Do we need whole 4 bytes for hash version, or is it for ease of use and alignment? > + - the number of hashes used in the computation All right. Perhaps we should test in the future patches that the value different from the default of 7 would also work. Also 8-bits / 1 byte for number of hashes (hash functions) should be enough: as I have written in prevous reply there is no need for k > 32. > + - the number of bits per entry This is important for construction of Bloom filter, but I think it is not necessary to use it -- so it may not be necessary to store it. Would also fit in a single byte: we don't need exceedingly low false positive probability. We could use it to estimate the false positive probability, and... > + * The rest of the chunk is the concatenation of all the computed bloom > + filters for the commits in lexicographic order. + * The rest of the chunk is the concatenation of all the computed Bloom + filters for the commits in lexicographic order. It would be, I think, a good idea to make it explicit that BDAT is present iff BIDX is present (iff == if and only if), i.e. that either both or neither of those chunks should be present. > + > Base Graphs List (ID: {'B', 'A', 'S', 'E'}) [Optional] > This list of H-byte hashes describe a set of B commit-graph files that > form a commit-graph chain. The graph position for the ith commit in this Best, -- Jakub Narębski