Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number

jnareb@xxxxxxxxx (Jakub Narębski) · Sun, 13 Sep 2020 17:39:00 +0200

Hello,

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>>>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>>>>
>>>> ...
>>>>
>>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>>>> difficult.  As far as I can see, we can choose one of the *three*
>>>> solutions (the third one is _new_):
>>>>
>>>> a. store 64-bit corrected commit date in the GDAT chunk
>>>>    all possible values are able to be stored, no need for
>>>>    GENERATION_NUMBER_V2_MAX,
>>>>
>>>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>>>    do not write GDAT chunk at all (like for backward compatibility
>>>>    with mixed-version chains of split commit-graph layers),
>>>>
>>>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>>>    using some kind of overflow handling scheme; for example if
>>>>    the most significant bit of 32-bit value is 1, then the
>>>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>>>    to store those corrected commit date offsets that do not
>>>>    fit in 32 bits.
>>
>> Note that I have posted more detailed analysis of advantages and
>> disadvantages of each of the above solutions in response to 11/11
>> https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/
>>
>> I can think of yet another solution, a variant of approach 'c' with
>> different overflow handling scheme:
>>
>> c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
>>      using the following overflow handling scheme: if the value
>>      is 0xFFFFFFFF (all bits set to 1, the maximum possible value
>>      for uint32_t), then the corrected commit date or corrected
>>      commit date offset can be found in GDOV chunk (Generation
>>      Data OVerflow handling).
>>
>>      The GDOV chunk is composed of:
>>      - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
>>      - 8 bytes (64 bits) of corrected commit date or its offset
>>
>>      Commits in GDOV chunk are sorted; as we expect for the number
>>      of commits that require GDOV to be zero or a very small number
>>      there is no need for GDO Fanout chunk.
>>
>>    - advantages:
>>      * commit-graph is smaller, increasing for abnormal repos
>>      * overflow limit reduced only by 1 (a single value)
>>    - disadvantages:
>>      * most complex code of all proposed solutions
>>        even more complicated than for solution 'c',
>>        different from EDGE chunk handling
>>      * tests would be needed to exercise the overflow code
>>
>> Or we can split overflow handling into two chunks: GDOI (Generation Data
>> Overflow Index) and GDOV, where GDOI would be composed of H bytes of
>> commit OID or 4 bytes of commit graph position (sorted), and GDOV would
>> be composed oly of 8 bytes (64 bits) of corrected commit date data.
>>
>> This c'') variant has the same advantages and disadvantages as c'), with
>> negligibly slightly larger disk size and possibly slightly better
>> performance because of better data locality.
>>
>
> The primary benefit of c') over c) seems to the range of valid offsets -
> c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
> betwen [0, 0x7FFFFFF].
>
> In other words, we should prefer c') over c) only if the offsets are
> usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)
>
> Commits were overflowing corrected committer date offsets are rare, and
> offsets in that particular range are doubly rare. To be wrong within
> the range would be have an offset of 68 to 136 years, so that's possible
> only if the corrupted timestamp is in future (it's been 50.68 years since
> Unix epoch 0 so far).

Right, the c) variant has the same limitation as if corrected commit
date offsets were stored as signed 32-bit integer (int32_t), so to have
overflow we would have date post Y2k38 followed by date of Unix epoch 0.
Very unlikely.

>
> Thinking back to the linux repository, the largest offset was around of
> the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds
>
> Overall, I don't think the added complexity (compared to c) approach)
> makes up for by greater versatility.
>
> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/

All right.

With variant c) we have additional advantage in that we can pattern the
code on the code for EDGE chunk handling, as you said.

I wanted to warn about the need for sanity checking, like ensuring that
we have GDOV chunk and that it is large enough -- but it turns out that
we skip this bounds checking for extra edges / EDGE chunk:

	if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) {
		pptr = insert_parent_or_die(r, g, edge_value, pptr);
		return 1;
	}

	parent_data_ptr = (uint32_t*)(g->chunk_extra_edges +
			  4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK));
	do {
		edge_value = get_be32(parent_data_ptr);
		pptr = insert_parent_or_die(r, g,
					    edge_value & GRAPH_EDGE_LAST_MASK,
					    pptr);
		parent_data_ptr++;
	} while (!(edge_value & GRAPH_LAST_EDGE));

Best,
-- 
Jakub Narębski