Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number

Jakub Narębski <jnareb@xxxxxxxxx> · Mon, 28 Sep 2020 23:48:05 +0200

Hello,

Jakub Narębski <jnareb@xxxxxxxxx> writes:
> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>> On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
>>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>>>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>>>>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>>>>>
>>>>> ...
>>>>>
>>>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>>>>> difficult.  As far as I can see, we can choose one of the *three*
>>>>> solutions (the third one is _new_):
>>>>>
>>>>> a. store 64-bit corrected commit date in the GDAT chunk
>>>>>    all possible values are able to be stored, no need for
>>>>>    GENERATION_NUMBER_V2_MAX,
>>>>>
>>>>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>>>>    do not write GDAT chunk at all (like for backward compatibility
>>>>>    with mixed-version chains of split commit-graph layers),
>>>>>
>>>>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>>    using some kind of overflow handling scheme; for example if
>>>>>    the most significant bit of 32-bit value is 1, then the
>>>>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>>>>    to store those corrected commit date offsets that do not
>>>>>    fit in 32 bits.
>>>
>>> Note that I have posted more detailed analysis of advantages and
>>> disadvantages of each of the above solutions in response to 11/11
>>> https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/
>>>
>>> I can think of yet another solution, a variant of approach 'c'
[...]
>>
>> The primary benefit of c') over c) seems to the range of valid offsets -
>> c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
>> betwen [0, 0x7FFFFFF].
>>
>> In other words, we should prefer c') over c) only if the offsets are
>> usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)
>>
>> Commits were overflowing corrected committer date offsets are rare, and
>> offsets in that particular range are doubly rare. To be wrong within
>> the range would be have an offset of 68 to 136 years, so that's possible
>> only if the corrupted timestamp is in future (it's been 50.68 years since
>> Unix epoch 0 so far).
>
> Right, the c) variant has the same limitation as if corrected commit
> date offsets were stored as signed 32-bit integer (int32_t), so to have
> overflow we would have date post Y2k38 followed by date of Unix epoch 0.
> Very unlikely.
>
>> Thinking back to the linux repository, the largest offset was around of
>> the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds
>>
>> Overall, I don't think the added complexity (compared to c) approach)
>> makes up for by greater versatility.
>>
>> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/
>
> All right.
>
> With variant c) we have additional advantage in that we can pattern the
> code on the code for EDGE chunk handling, as you said.
>
> I wanted to warn about the need for sanity checking, like ensuring that
> we have GDOV chunk and that it is large enough -- but it turns out that
> we skip this bounds checking for extra edges / EDGE chunk:
>
> 	if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
> 		pptr = insert_parent_or_die(r, g, edge_value, pptr);
> 		return 1;
> 	}
>
> 	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
> 			  4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
> 	do {
> 		edge_value = get_be32(parent_data_ptr);
> 		pptr = insert_parent_or_die(r, g,
> 					    edge_value & GRAPH_EDGE_LAST_MASK,
> 					    pptr);
> 		parent_data_ptr++;
> 	} while (!(edge_value & GRAPH_LAST_EDGE));

Both Taylor Blau and Junio C Hamano agree that it is better to store
corrected commit date offsets and have overload handling to halve (to
reduce by 50%) the size of the GDAT chunk.  I have not heard from
Derrick Stolee.

It looks then that it is the way to go; as I said that you have
convinced me that variant 'c' (EDGE-like) is the best solution for
overflow handling.

Best,
-- 
Jakub Narębski