Hello, Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: > On Thu, Sep 03, 2020 at 03:42:43PM +0200, Jakub Narębski wrote: >> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >>> On Tue, Aug 25, 2020 at 02:18:24PM +0200, Jakub Narębski wrote: >>>> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: >>>> >>>> ... >>>> >>>> However, as I wrote, handling GENERATION_NUMBER_V2_OFFSET_MAX is >>>> difficult. As far as I can see, we can choose one of the *three* >>>> solutions (the third one is _new_): >>>> >>>> a. store 64-bit corrected commit date in the GDAT chunk >>>> all possible values are able to be stored, no need for >>>> GENERATION_NUMBER_V2_MAX, >>>> >>>> b. store 32-bit corrected commit date offset in the GDAT chunk, >>>> if its value is larger than GENERATION_NUMBER_V2_OFFSET_MAX, >>>> do not write GDAT chunk at all (like for backward compatibility >>>> with mixed-version chains of split commit-graph layers), >>>> >>>> c. store 32-bit corrected commit date offset in the GDAT chunk, >>>> using some kind of overflow handling scheme; for example if >>>> the most significant bit of 32-bit value is 1, then the >>>> rest 31-bits are position in GDOV chunk, which uses 64-bit >>>> to store those corrected commit date offsets that do not >>>> fit in 32 bits. >> >> Note that I have posted more detailed analysis of advantages and >> disadvantages of each of the above solutions in response to 11/11 >> https://public-inbox.org/git/85o8mwb6nq.fsf@xxxxxxxxx/ >> >> I can think of yet another solution, a variant of approach 'c' with >> different overflow handling scheme: >> >> c'. Store 32-bit corrected commit date offset in the GDAT chunk, >> using the following overflow handling scheme: if the value >> is 0xFFFFFFFF (all bits set to 1, the maximum possible value >> for uint32_t), then the corrected commit date or corrected >> commit date offset can be found in GDOV chunk (Generation >> Data OVerflow handling). >> >> The GDOV chunk is composed of: >> - H bytes of commit OID, or 4 bytes (32 bits) of commit pos >> - 8 bytes (64 bits) of corrected commit date or its offset >> >> Commits in GDOV chunk are sorted; as we expect for the number >> of commits that require GDOV to be zero or a very small number >> there is no need for GDO Fanout chunk. >> >> - advantages: >> * commit-graph is smaller, increasing for abnormal repos >> * overflow limit reduced only by 1 (a single value) >> - disadvantages: >> * most complex code of all proposed solutions >> even more complicated than for solution 'c', >> different from EDGE chunk handling >> * tests would be needed to exercise the overflow code >> >> Or we can split overflow handling into two chunks: GDOI (Generation Data >> Overflow Index) and GDOV, where GDOI would be composed of H bytes of >> commit OID or 4 bytes of commit graph position (sorted), and GDOV would >> be composed oly of 8 bytes (64 bits) of corrected commit date data. >> >> This c'') variant has the same advantages and disadvantages as c'), with >> negligibly slightly larger disk size and possibly slightly better >> performance because of better data locality. >> > > The primary benefit of c') over c) seems to the range of valid offsets - > c') can range from [0, 0xFFFFFFFF) whereas offsets for c) can range > betwen [0, 0x7FFFFFF]. > > In other words, we should prefer c') over c) only if the offsets are > usually in the range [0x7FFFFFFF + 1, 0xFFFFFFFF) > > Commits were overflowing corrected committer date offsets are rare, and > offsets in that particular range are doubly rare. To be wrong within > the range would be have an offset of 68 to 136 years, so that's possible > only if the corrupted timestamp is in future (it's been 50.68 years since > Unix epoch 0 so far). Right, the c) variant has the same limitation as if corrected commit date offsets were stored as signed 32-bit integer (int32_t), so to have overflow we would have date post Y2k38 followed by date of Unix epoch 0. Very unlikely. > > Thinking back to the linux repository, the largest offset was around of > the order of 2 ^ 25 (offset of 1.06 years) and I would assume that holds > > Overall, I don't think the added complexity (compared to c) approach) > makes up for by greater versatility. > > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/ All right. With variant c) we have additional advantage in that we can pattern the code on the code for EDGE chunk handling, as you said. I wanted to warn about the need for sanity checking, like ensuring that we have GDOV chunk and that it is large enough -- but it turns out that we skip this bounds checking for extra edges / EDGE chunk: if (!(edge_value & GRAPH_EXTRA_EDGES_NEEDED)) { pptr = insert_parent_or_die(r, g, edge_value, pptr); return 1; } parent_data_ptr = (uint32_t*)(g->chunk_extra_edges + 4 * (uint64_t)(edge_value & GRAPH_EDGE_LAST_MASK)); do { edge_value = get_be32(parent_data_ptr); pptr = insert_parent_or_die(r, g, edge_value & GRAPH_EDGE_LAST_MASK, pptr); parent_data_ptr++; } while (!(edge_value & GRAPH_LAST_EDGE)); Best, -- Jakub Narębski