Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
>> 
>> ...
>> 
>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
>> difficult.  As far as I can see, we can choose one of the *three*
>> solutions (the third one is _new_):
>> 
>> a. store 64-bit corrected commit date in the GDAT chunk
>>    all possible values are able to be stored, no need for
>>    GENERATION_NUMBER_V2_MAX,
>> 
>> b. store 32-bit corrected commit date offset in the GDAT chunk,
>>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
>>    do not write GDAT chunk at all (like for backward compatibility
>>    with mixed-version chains of split commit-graph layers),
>> 
>> c. store 32-bit corrected commit date offset in the GDAT chunk,
>>    using some kind of overflow handling scheme; for example if
>>    the most significant bit of 32-bit value is 1, then the
>>    rest 31-bits are position in GDOV chunk, which uses 64-bit
>>    to store those corrected commit date offsets that do not
>>    fit in 32 bits.

Note that I have posted more detailed analysis of advantages and
disadvantages of each of the above solutions in response to 11/11
https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/

I can think of yet another solution, a variant of approach 'c' with
different overflow handling scheme:

c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
     using the following overflow handling scheme: if the value
     is 0xFFFFFFFF (all bits set to 1, the maximum possible value
     for uint32_t), then the corrected commit date or corrected
     commit date offset can be found in GDOV chunk (Generation
     Data OVerflow handling).

     The GDOV chunk is composed of:
     - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
     - 8 bytes (64 bits) of corrected commit date or its offset
     
     Commits in GDOV chunk are sorted; as we expect for the number
     of commits that require GDOV to be zero or a very small number
     there is no need for GDO Fanout chunk.
     
   - advantages:
     * commit-graph is smaller, increasing for abnormal repos
     * overflow limit reduced only by 1 (a single value)
   - disadvantages:
     * most complex code of all proposed solutions
       even more complicated than for solution 'c',
       different from EDGE chunk handling
     * tests would be needed to exercise the overflow code

Or we can split overflow handling into two chunks: GDOI (Generation Data
Overflow Index) and GDOV, where GDOI would be composed of H bytes of
commit OID or 4 bytes of commit graph position (sorted), and GDOV would
be composed oly of 8 bytes (64 bits) of corrected commit date data.

This c'') variant has the same advantages and disadvantages as c'), with
negligibly slightly larger disk size and possibly slightly better
performance because of better data locality.

>
> Alright, so the third solution leverages the fact that in practice,
> very few offsets would overflow the 32-bit limit. Using 64-bits for all
> offsets would be wasteful, we can trade off a miniscule amount of
> computation to save large amounts of disk space.

On the other hand we can say that we can trade negligible increase of
commit-graph disk space size (less than 7% in worst case: no octopus
merges, no changed-path Bloom filter data, using SHA-1 for object ids,
large repository so header size + OIFD is negligible) for simpler code
with no need for overflow handling at all (and a minuscule amount of
less computations).

>>
>> This type of schema is used in other places in Git code, if I remember
>> it correctly.
>> 
>
> Yes, it's a similar idea to the extra edge list chunk, where the most
> significant bit of second parent indicates whether they are more than
> two parents.

Yes and no.  Yes, the solution 'c' uses exactly the same mechanism as
the pointer from Commid Data chunk into Extra Edges List chunk:

      [...]  If there are more than two parents, the second value
      has its most-significant bit on and the other bits store an array
      position into the Extra Edge List chunk.

On the other hand we need to have some kind of overflow handling for the
list of parents, as the number of parents is not limited in Git (there
is no technical upper limit on the number of parents a commits can
have), as opposed to for example Mercurial.  This is not the case for
storing corrected commit date (or corrected commit date offset), as 64
bits is all we would ever need.

> It's definitely feasible, albeit a little complex.
>
> What's the overall consensus on the third solution?

Still waiting for others to weight in.

Best,
-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux