Re: [PATCH v2 09/11] commit-graph: merge commit-graph chains

Derrick Stolee <stolee@xxxxxxxxx> · Thu, 23 May 2019 09:00:56 -0400

On 5/22/2019 8:43 PM, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, May 22 2019, Derrick Stolee via GitGitGadget wrote:
> 
>> To keep lookups fast, but also keep most incremental writes fast, create
>> a strategy for merging levels of the commit-graph chain. The strategy is
>> detailed in the commit-graph design document, but is summarized by these
>> two conditions:
>>
>>   1. If the number of commits we are adding is more than half the number
>>      of commits in the graph below, then merge with that graph.
>>
>>   2. If we are writing more than 64,000 commits into a single graph,
>>      then merge with all lower graphs.
>>
>> The numeric values in the conditions above are currently constant, but
>> can become config options in a future update.
>> [...]
>> +## Merge Strategy
>> +
>> +When writing a set of commits that do not exist in the commit-graph stack of
>> +height N, we default to creating a new file at level N + 1. We then decide to
>> +merge with the Nth level if one of two conditions hold:
>> +
>> +  1. The expected file size for level N + 1 is at least half the file size for
>> +     level N.
>> +
>> +  2. Level N + 1 contains more than MAX_SPLIT_COMMITS commits (64,0000
>> +     commits).
>> +
>> +This decision cascades down the levels: when we merge a level we create a new
>> +set of commits that then compares to the next level.
>> +
>> +The first condition bounds the number of levels to be logarithmic in the total
>> +number of commits.  The second condition bounds the total number of commits in
>> +a `graph-{hashN}` file and not in the `commit-graph` file, preventing
>> +significant performance issues when the stack merges and another process only
>> +partially reads the previous stack.
>> +
>> +The merge strategy values (2 for the size multiple, 64,000 for the maximum
>> +number of commits) could be extracted into config settings for full
>> +flexibility.
> 
> As noted this can become configurable, so it's no big deal. But is there
> any reason for ths 64K limit anymore?

There may not be an important reason to include it by default. Whatever config
option we use could have special values:

 * -1: No maximum commit limit (default)
 *  0: Never write more than one level.

I would personally set a limit somewhere around 64,000 to prevent long chains
and having some processes hit a concurrency issue where their run-time
commit-graph is effectively 100,000 commits behind.

> While with the default expiry of 0sec we can still get that race, it
> seems unlikely in practice, as the "commit-graph write" process would
> write a new manifest at the end, then go and unlink() the old files.

The process reading an old manifest can still only succeed partially as it
reads the chain from bottom-to-top. Succeeding gracefully in this case is
built-in, but should have a test! (noted)

But I don't think the concurrency window is any more lenient than before.
We now have the flexibility to set a non-zero expiry window. That was
impossible in the other model.

> So maybe at this point we could make this even dumber with something
> that behaves like gc.autoPackLimit? I.e. keep writing new graphs, and
> then coalesce them all (or maybe not the "base" graph, like
> gc.bigPackThreshold)?

I will look into these existing settings and try to include similar options
here. Better to stay consistent when we can.

> Also: These docs refer to MAX_SPLIT_COMMITS, but in v2 it's now a
> "split_strategy_max_commits" variable instead.

Thanks for catching my out-of-date documentation.

-Stolee