Re: [PATCH v3 11/11] doc: add corrected commit date info

jnareb@xxxxxxxxx (Jakub Narębski) · Thu, 27 Aug 2020 14:43:53 +0200

Hello,

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote:
>> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
[...]

>>> +    * This list of 4-byte values store corrected commit date offsets for the
>>> +      commits, arranged in the same order as commit data chunk.
>>
>> I have just realized purely theoretical, but possible, problem with
>> storing non-monotinic generation number related values like corrected
>> commit date offset in constrained space.  There are problems with
>> clamping them.
>>
>> Say that somewhere in the ancestry chain there is a commit A with commit
>> date far in the future by mistake, for example 2120-08-22; it is
>> important for that date to be not able to be represented using uint32_t.
>> Say that a later descendant commit B is malformed, and has committer
>> date of 0, that is 1970-01-01. This means that the corrected commit date
>> for B must be larger than 2120-08-22 - which for this commit means that
>> corrected commit date offset do not fit in 32 bits, and must be clamped
>> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX.
>>
>> Say that we have commit C that is child of B, and it has correct commit
>> date.  Because of mistake in commit A, it has corrected commit date of
>> more than 2120-08-22 (corrected commit date degenerated into topological
>> level plus constant).
>>
>> Now C can reach B, and B can reach A.  However, if we recover corrected
>> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX
>> we get a number that is smaller than correct corrected commit date.  We
>> will have
>>
>>    gen(A) > date(B) + offset(B) < gen(C)
>>
>> Which breaks reachability condition guarantee.
>>
>> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped
>> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX,
>> we would get
>>
>>   gen(A) < GENERATION_NUMBER_V2_MAX > gen(C)
>>
>> And again reachability condition is broken.
>>
>> This is a very contrived but possible example.  This shouldn't happen,
>> but ufortunately it can happen.
>>
>
> Yes, that's very unfortunate.
>
> Here's a much simpler example:
>
> A commit P has an reasonable commit date (i.e. after release of Git to
> present) D and has a child commit C with committer date 0. Now, the
> corrected commiter date of C would D + 1 and the offset would be same too,
> as the committer date is zero. This overflows as reasonable dates are of
> the order 2 ^ 34.

No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value,
so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07),
which is related to storing date as a _signed_ 32-bit integer

The current-ish Unix epoch time is 1598524281 - let's use it for value
of D.  Then the offset for commit C would be 1598524282.  The current
proposal uses 32 bits to store commit date offsets (as unsigned value).
The maximum value of offset that we can store is therefore 2^32 - 1,
which is 4294967295.

   corrected commit date offset(C) = 1,598,524,282
   GENERATION_NUMBER_V2_MAX        = 4,294,967,295

As you can see there is no overflow in the simplified example.

>>
>> The question is how to deal with this issue.  Ignore it as unlikely?
>> Switch to storing corrected commit date, which is monotonic, so if there
>> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant
>> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7%
>> larger commit-graph file?
>>
>
> To be honest, I would prefer storing corrected committer dates over
> storing offsets.
>
> While it is 7% of the size of commit-graph file, it is also *only* around
> ~3.5 MB for a repository of the size of linux kernel (and IIRC
> correctly, the Windows repo has ~2M commits, it amounts to ~8 MB).

It is up to 7% of per-commit data, and it doesn't take into account EDGE
chunk (for octopus merges), and it doesn't also take into account the
size of changed-paths Bloom filters data take in the commit-graph.

> Minimizing space and memory requirements are a top priority, but
> shouldn't making sure our program is correct and efficient to be a
> greater priority?

On the other hand the case where we would encounter offsets that do not
fit in uint32_t is extremply unlikely in sane repositories.

I can think of three solutions:

1. use 64-bit corrected commit dates
   - advantages:
     * simplest code,
     * no need for overflow handling, as we can store all possible values
       of timestamp_t
   - disadvantages:
     * commit-graph size increased by up to 7%

2. use 32-bit corrected commit date offsets,
   but simply do not store GDAT chunk if there is offset that would not
   fit in 32-bit wide field
   - advantages:
     * commit-graph is smaller
     * relatively simple overflow handling
   - disadvantages:
     * performance penalty (generation number v1 vs v2) for abnormal
       repositories (with overflow not fitting in uint32_t)
     * tests would be needed to exercise the overflow code

3. use 32-bit for corrected commit date offset,
   with oveflow handling, for example using most significant bit
   to denote that other bits store position into offset overflow
   with 64-bits for those offsets that do not fit in 31-bits
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
   - disadvantages:
     * most complex code of all proposed solutions
     * smaller overflow limit of 2^31 - 1
     * tests would be needed to exercise the overflow code

I think because the situation where we encounter overflow in 32-bit
corrected commit date offset is rare, we should go with either 1 or 2
solution.

> I would love to hear your and Dr. Stolee's opinions on this.

I have CC-ed Junio C Hamano to ask for his opinion.

>>> +    * This list can be later modified to store future generation number related
>>> +      data.
>>
>> How can it be later modified?  There is no header, no version number.
>> How would we add another generation number data?
>>
>
> We could modify the graph version in future. Here's how I think it would
> work:
>
> Graph Version 1, No GDAT -> Topological level
> Graph Version 2, GDAT    -> Corrected committer dates
> Graph Version 3, GDAT    -> Generation number v3
>
> and so on.
>
> Of course, we do not have to update generation number definition for
> each graph version.

So it was about generic mechanism, not something specific to the GDAT chunk.

> However, my statement could still be wrong for things that we do not
> foresee (similar to how we missed the hard die on different graph version),
> so I am removing the statement.

Good.

[...]
>>> +We also merge commit-graph chains when we try to write a commit graph with
>>> +two different generation number definitions as they cannot be compared directly.
>>> +We overwrite the existing chain and create a commit-graph with the newer or more
>>> +efficient defintion. For example, overwriting topological levels commit graph
>>> +chain to create a corrected commit dates commit graph chain.
>>> +
>>
>> This is more complicated than that.
>>
>> I think we should explicitly state that Git ensures that in split
>> commit-graph chain, if there are layers without the GDAT chunk (that
>> force Git to use topological levels for generation numbers), then they
>> are top layers.  So if there is commit-graph file created by "Old" Git,
>> then when addig new layer it would also be GDAT-less.
>>
>> Now how to write this...
>
> Thinking about this, I feel creating a new section called "Handling
> Mixed Generation Number Chains" made more sense:
>
>   ## Handling Mixed Generation Number Chains
>
>   With the introduction of generation number v2 and generation data chunk,
>   the following scenario is possible:
>
>   1. "New" Git writes a commit-graph with a GDAT chunk.
>   2. "Old" Git writes a split commit-graph on top without a GDAT chunk.
>
>   The commits in the lower layer will be interpreted as having very large
>   generation values (commit date plus offset) compared to the generation
>   numbers in the top layer (toplogical level). This violates the
>   expectation that the generation of a parent is strictly smaller than the
>   generation of a child. In such cases, we revert to using topological
>   levels for all layers to maintain backwards compatability.
>
>   When writing a new layer in split commit-graph, we write a GDAT chunk
>   only if the topmost layer has a GDAT chunk. This guarantees that if a
>   lyer has GDAT chunk, all lower layers must have a GDAT chunk as well.
>
>   Rewriting layers follows similar approach: if the topmost layer below
>   set of layers being rewriteen (in the split commit-graph chain) exists,
>   and it does not contain GDAT chunk, then the result of rewrite does not
>   have GDAT chunks either.

Good idea, and nice writeup.

Best,
-- 
Jakub Narębski