Re: [PATCH v3 05/11] commit-graph: return 64-bit generation number

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote:
> Hi,
> 
> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> > On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote:
> >> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> >> 
> >> ...
> >> 
> >> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is
> >> difficult.  As far as I can see, we can choose one of the *three*
> >> solutions (the third one is _new_):
> >> 
> >> a. store 64-bit corrected commit date in the GDAT chunk
> >>    all possible values are able to be stored, no need for
> >>    GENERATION_NUMBER_V2_MAX,
> >> 
> >> b. store 32-bit corrected commit date offset in the GDAT chunk,
> >>    if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX,
> >>    do not write GDAT chunk at all (like for backward compatibility
> >>    with mixed-version chains of split commit-graph layers),
> >> 
> >> c. store 32-bit corrected commit date offset in the GDAT chunk,
> >>    using some kind of overflow handling scheme; for example if
> >>    the most significant bit of 32-bit value is 1, then the
> >>    rest 31-bits are position in GDOV chunk, which uses 64-bit
> >>    to store those corrected commit date offsets that do not
> >>    fit in 32 bits.
> 
> Note that I have posted more detailed analysis of advantages and
> disadvantages of each of the above solutions in response to 11/11
> https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/
> 
> I can think of yet another solution, a variant of approach 'c' with
> different overflow handling scheme:
> 
> c'.  Store 32-bit corrected commit date offset in the GDAT chunk,
>      using the following overflow handling scheme: if the value
>      is 0xFFFFFFFF (all bits set to 1, the maximum possible value
>      for uint32_t), then the corrected commit date or corrected
>      commit date offset can be found in GDOV chunk (Generation
>      Data OVerflow handling).
> 
>      The GDOV chunk is composed of:
>      - H bytes of commit OID, or 4 bytes (32 bits) of commit pos
>      - 8 bytes (64 bits) of corrected commit date or its offset
>      
>      Commits in GDOV chunk are sorted; as we expect for the number
>      of commits that require GDOV to be zero or a very small number
>      there is no need for GDO Fanout chunk.
>      
>    - advantages:
>      * commit-graph is smaller, increasing for abnormal repos
>      * overflow limit reduced only by 1 (a single value)
>    - disadvantages:
>      * most complex code of all proposed solutions
>        even more complicated than for solution 'c',
>        different from EDGE chunk handling
>      * tests would be needed to exercise the overflow code
> 
> Or we can split overflow handling into two chunks: GDOI (Generation Data
> Overflow Index) and GDOV, where GDOI would be composed of H bytes of
> commit OID or 4 bytes of commit graph position (sorted), and GDOV would
> be composed oly of 8 bytes (64 bits) of corrected commit date data.
> 
> This c'') variant has the same advantages and disadvantages as c'), with
> negligibly slightly larger disk size and possibly slightly better
> performance because of better data locality.
> 

The primary benefit of c') over c) seems to the range of valid offsets -
c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range
betwen [0, 0x7FFFFFF].

In other words, we should prefer c') over c) only if the offsets are
usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF)

Commits were overflowing corrected committer date offsets are rare, and
offsets in that particular range are doubly rare. To be wrong within
the range would be have an offset of 68 to 136 years, so that's possible
only if the corrupted timestamp is in future (it's been 50.68 years since
Unix epoch 0 so far).

Thikning back to the linux repository, the largest offset was around of
the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds

Overall, I don't think the added complexity (compared to c) approach)
makes up for by greater versatility.

[1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/

Thanks
- Abhishek



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux