Hello, Jakub Narębski <jnareb@xxxxxxxxx> writes: > Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >> On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote: >>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >>>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote: >>>>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >>>>> >>>>> ... >>>>> >>>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is >>>>> difficult. As far as I can see, we can choose one of the *three* >>>>> solutions (the third one is _new_): >>>>> >>>>> a. store 64-bit corrected commit date in the GDAT chunk >>>>> all possible values are able to be stored, no need for >>>>> GENERATION_NUMBER_V2_MAX, >>>>> >>>>> b. store 32-bit corrected commit date offset in the GDAT chunk, >>>>> if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX, >>>>> do not write GDAT chunk at all (like for backward compatibility >>>>> with mixed-version chains of split commit-graph layers), >>>>> >>>>> c. store 32-bit corrected commit date offset in the GDAT chunk, >>>>> using some kind of overflow handling scheme; for example if >>>>> the most significant bit of 32-bit value is 1, then the >>>>> rest 31-bits are position in GDOV chunk, which uses 64-bit >>>>> to store those corrected commit date offsets that do not >>>>> fit in 32 bits. >>> >>> Note that I have posted more detailed analysis of advantages and >>> disadvantages of each of the above solutions in response to 11/11 >>> https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/ >>> >>> I can think of yet another solution, a variant of approach 'c' [...] >> >> The primary benefit of c') over c) seems to the range of valid offsets - >> c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range >> betwen [0, 0x7FFFFFF]. >> >> In other words, we should prefer c') over c) only if the offsets are >> usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF) >> >> Commits were overflowing corrected committer date offsets are rare, and >> offsets in that particular range are doubly rare. To be wrong within >> the range would be have an offset of 68 to 136 years, so that's possible >> only if the corrupted timestamp is in future (it's been 50.68 years since >> Unix epoch 0 so far). > > Right, the c) variant has the same limitation as if corrected commit > date offsets were stored as signed 32-bit integer (int32_t), so to have > overflow we would have date post Y2k38 followed by date of Unix epoch 0. > Very unlikely. > >> Thinking back to the linux repository, the largest offset was around of >> the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds >> >> Overall, I don't think the added complexity (compared to c) approach) >> makes up for by greater versatility. >> >> [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/ > > All right. > > With variant c) we have additional advantage in that we can pattern the > code on the code for EDGE chunk handling, as you said. > > I wanted to warn about the need for sanity checking, like ensuring that > we have GDOV chunk and that it is large enough -- but it turns out that > we skip this bounds checking for extra edges / EDGE chunk: > > if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) { > pptr = insert_parent_or_die(r, g, edge_value, pptr); > return 1; > } > > parent_data_ptr = (uint32_t*)(g->chunk_extra_edges + > 4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK)); > do { > edge_value = get_be32(parent_data_ptr); > pptr = insert_parent_or_die(r, g, > edge_value & GRAPH_EDGE_LAST_MASK, > pptr); > parent_data_ptr++; > } while (!(edge_value & GRAPH_LAST_EDGE)); Both Taylor Blau and Junio C Hamano agree that it is better to store corrected commit date offsets and have overload handling to halve (to reduce by 50%) the size of the GDAT chunk. I have not heard from Derrick Stolee. It looks then that it is the way to go; as I said that you have convinced me that variant 'c' (EDGE-like) is the best solution for overflow handling. Best, -- Jakub Narębski