Re: Making split commit graphs pick up new options (namely --changed-paths)

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 10 Jun 2021 14:21:16 -0400

On 6/10/2021 1:22 PM, Taylor Blau wrote:
> On Thu, Jun 10, 2021 at 12:40:33PM +0200, Ævar Arnfjörð Bjarmason wrote:
...
>> Reading the code there seems to be no way to do that, and we have the
>> "chunk_bloom_data" in the graph, as well as "bloom_filter_settings".
>>
>> I'd expect some way to combine the "max_new_filters" and --split with
>> some eventual-consistency logic so that graphs not matching our current
>> settings are replaced, or replaced some <limit> at a time.
> 
> This is asking about something slightly different, Bloom filter
> settings rather than the existence of chagned-path Bloom filters
> themselves. The Bloom settings aren't written to the commit-graph
> although there has been some discussion about doing this in the past.

Some of the settings are included, but not the "maximum size" of a
filter. Thus, if that maximum size changes we do not have a way to
determine if a missing filter is larger than that limit or not.
Further, the existing filters might be larger than the new maximum
which means we would need to check if some of those filters should
be dropped.

Here is the spec of the BDAT chunk:

Bloom Filter Data (ID: {'B', 'D', 'A', 'T'}) [Optional]
    * It starts with header consisting of three unsigned 32-bit integers:
      - Version of the hash algorithm being used. We currently only support
	value 1 which corresponds to the 32-bit version of the murmur3 hash
	implemented exactly as described in
	https://en.wikipedia.org/wiki/MurmurHash#Algorithm and the double
	hashing technique using seed values 0x293ae76f and 0x7e646e2 as
	described in https://doi.org/10.1007/978-3-540-30494-4_26 "Bloom Filters
	in Probabilistic Verification"
      - The number of times a path is hashed and hence the number of bit positions
	      that cumulatively determine whether a file is present in the commit.
      - The minimum number of bits 'b' per entry in the Bloom filter. If the filter
	      contains 'n' entries, then the filter size is the minimum number of 64-bit
	      words that contain n*b bits.
    * The rest of the chunk is the concatenation of all the computed Bloom
      filters for the commits in lexicographic order.

> If we ever did encode the Bloom settings, I imagine that accomplishing a
> sort of "eventually replace all changed-path Bloom filters with these
> new settings" would be as simple as considering all filters computed
> under different settings to be "uncomputed".
> 
>> Also, am I reading the expire_commit_graphs() logic correctly that we
>> first write the split graph, and then unlink() things that are too old?
>> I.e. if you rely on the commit-graph to optimize things this will make
>> things slower until the next run of writing the graph?
> 
> Before expire_commit_graphs() is called, we call mark_commit_graphs()
> which freshens the mtimes of all surviving commit-graph layers, and then
> expire_commit_graphs() removes the stale layers. I'm not sure what
> things getting slower is referring to since the resulting commit-graph
> has at least as many commits as the commit-graph that existed prior to
> the write.
> 
>> I expected to find something more gentle there [...]
> 
> FWIW, I also find this "expire based on mtimes" thing a little odd for
> writing split commit-graphs because we know exactly which layers we want
> to get rid of. I suspect that the reuse comes from wanting to unify the
> logic for handling '--expire-time' with the expiration that happens
> after writing a split commit-graph that merged two or more previous
> layers.
> 
> I would probably change mark_commit_graphs() to remove those merged
> layers explicitly (but still run expire_commit_graphs() to handle
> --expire-time). But, come to think of it... if merging >2 layers already
> causes the merged layers to be removed, then why would you ever set an
> --expire-time yourself?

The --expire-time is intended to leave the old layers around a while
so concurrent processes who already parsed the commit-graph-chain file
can discover the layers it referenced. It's not a perfect mechanism, so
there is room for improvement here.

Thanks,
-Stolee