Hello, Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: > On Sun, Aug 23, 2020 at 12:20:57AM +0200, Jakub Narębski wrote: >> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes: [...] >>> + * This list of 4-byte values store corrected commit date offsets for the >>> + commits, arranged in the same order as commit data chunk. >> >> I have just realized purely theoretical, but possible, problem with >> storing non-monotinic generation number related values like corrected >> commit date offset in constrained space. There are problems with >> clamping them. >> >> Say that somewhere in the ancestry chain there is a commit A with commit >> date far in the future by mistake, for example 2120-08-22; it is >> important for that date to be not able to be represented using uint32_t. >> Say that a later descendant commit B is malformed, and has committer >> date of 0, that is 1970-01-01. This means that the corrected commit date >> for B must be larger than 2120-08-22 - which for this commit means that >> corrected commit date offset do not fit in 32 bits, and must be clamped >> (replaced) with GENERATION_NUMBER_V2_OFFSET_MAX. >> >> Say that we have commit C that is child of B, and it has correct commit >> date. Because of mistake in commit A, it has corrected commit date of >> more than 2120-08-22 (corrected commit date degenerated into topological >> level plus constant). >> >> Now C can reach B, and B can reach A. However, if we recover corrected >> commit date of B out of its date=0 and offset=GENERATION_NUMBER_V2_OFFSET_MAX >> we get a number that is smaller than correct corrected commit date. We >> will have >> >> gen(A) > date(B) + offset(B) < gen(C) >> >> Which breaks reachability condition guarantee. >> >> If instead we use GENERATION_NUMBER_V2_MAX for commits with clamped >> corrected commit date, that is offset=GENERATION_NUMBER_V2_OFFSET_MAX, >> we would get >> >> gen(A) < GENERATION_NUMBER_V2_MAX > gen(C) >> >> And again reachability condition is broken. >> >> This is a very contrived but possible example. This shouldn't happen, >> but ufortunately it can happen. >> > > Yes, that's very unfortunate. > > Here's a much simpler example: > > A commit P has an reasonable commit date (i.e. after release of Git to > present) D and has a child commit C with committer date 0. Now, the > corrected commiter date of C would D + 1 and the offset would be same too, > as the committer date is zero. This overflows as reasonable dates are of > the order 2 ^ 34. No, we need the value of date D that doesn't fit in 2^32 _unsigned_ value, so it needs to be even more in the future than Y2k38 (2038-01-19 03:14:07), which is related to storing date as a _signed_ 32-bit integer The current-ish Unix epoch time is 1598524281 - let's use it for value of D. Then the offset for commit C would be 1598524282. The current proposal uses 32 bits to store commit date offsets (as unsigned value). The maximum value of offset that we can store is therefore 2^32 - 1, which is 4294967295. corrected commit date offset(C) = 1,598,524,282 GENERATION_NUMBER_V2_MAX = 4,294,967,295 As you can see there is no overflow in the simplified example. >> >> The question is how to deal with this issue. Ignore it as unlikely? >> Switch to storing corrected commit date, which is monotonic, so if there >> is commit with GENERATION_NUMBER_V2_MAX, then subsequent descendant >> commits will also have GENERATION_NUMBER_V2_MAX -- and pay with up to 7% >> larger commit-graph file? >> > > To be honest, I would prefer storing corrected committer dates over > storing offsets. > > While it is 7% of the size of commit-graph file, it is also *only* around > ~3.5 MB for a repository of the size of linux kernel (and IIRC > correctly, the Windows repo has ~2M commits, it amounts to ~8 MB). It is up to 7% of per-commit data, and it doesn't take into account EDGE chunk (for octopus merges), and it doesn't also take into account the size of changed-paths Bloom filters data take in the commit-graph. > Minimizing space and memory requirements are a top priority, but > shouldn't making sure our program is correct and efficient to be a > greater priority? On the other hand the case where we would encounter offsets that do not fit in uint32_t is extremply unlikely in sane repositories. I can think of three solutions: 1. use 64-bit corrected commit dates - advantages: * simplest code, * no need for overflow handling, as we can store all possible values of timestamp_t - disadvantages: * commit-graph size increased by up to 7% 2. use 32-bit corrected commit date offsets, but simply do not store GDAT chunk if there is offset that would not fit in 32-bit wide field - advantages: * commit-graph is smaller * relatively simple overflow handling - disadvantages: * performance penalty (generation number v1 vs v2) for abnormal repositories (with overflow not fitting in uint32_t) * tests would be needed to exercise the overflow code 3. use 32-bit for corrected commit date offset, with oveflow handling, for example using most significant bit to denote that other bits store position into offset overflow with 64-bits for those offsets that do not fit in 31-bits - advantages: * commit-graph is smaller, increasing for abnormal repos - disadvantages: * most complex code of all proposed solutions * smaller overflow limit of 2^31 - 1 * tests would be needed to exercise the overflow code I think because the situation where we encounter overflow in 32-bit corrected commit date offset is rare, we should go with either 1 or 2 solution. > I would love to hear your and Dr. Stolee's opinions on this. I have CC-ed Junio C Hamano to ask for his opinion. >>> + * This list can be later modified to store future generation number related >>> + data. >> >> How can it be later modified? There is no header, no version number. >> How would we add another generation number data? >> > > We could modify the graph version in future. Here's how I think it would > work: > > Graph Version 1, No GDAT -> Topological level > Graph Version 2, GDAT -> Corrected committer dates > Graph Version 3, GDAT -> Generation number v3 > > and so on. > > Of course, we do not have to update generation number definition for > each graph version. So it was about generic mechanism, not something specific to the GDAT chunk. > However, my statement could still be wrong for things that we do not > foresee (similar to how we missed the hard die on different graph version), > so I am removing the statement. Good. [...] >>> +We also merge commit-graph chains when we try to write a commit graph with >>> +two different generation number definitions as they cannot be compared directly. >>> +We overwrite the existing chain and create a commit-graph with the newer or more >>> +efficient defintion. For example, overwriting topological levels commit graph >>> +chain to create a corrected commit dates commit graph chain. >>> + >> >> This is more complicated than that. >> >> I think we should explicitly state that Git ensures that in split >> commit-graph chain, if there are layers without the GDAT chunk (that >> force Git to use topological levels for generation numbers), then they >> are top layers. So if there is commit-graph file created by "Old" Git, >> then when addig new layer it would also be GDAT-less. >> >> Now how to write this... > > Thinking about this, I feel creating a new section called "Handling > Mixed Generation Number Chains" made more sense: > > ## Handling Mixed Generation Number Chains > > With the introduction of generation number v2 and generation data chunk, > the following scenario is possible: > > 1. "New" Git writes a commit-graph with a GDAT chunk. > 2. "Old" Git writes a split commit-graph on top without a GDAT chunk. > > The commits in the lower layer will be interpreted as having very large > generation values (commit date plus offset) compared to the generation > numbers in the top layer (toplogical level). This violates the > expectation that the generation of a parent is strictly smaller than the > generation of a child. In such cases, we revert to using topological > levels for all layers to maintain backwards compatability. > > When writing a new layer in split commit-graph, we write a GDAT chunk > only if the topmost layer has a GDAT chunk. This guarantees that if a > lyer has GDAT chunk, all lower layers must have a GDAT chunk as well. > > Rewriting layers follows similar approach: if the topmost layer below > set of layers being rewriteen (in the split commit-graph chain) exists, > and it does not contain GDAT chunk, then the result of rewrite does not > have GDAT chunks either. Good idea, and nice writeup. Best, -- Jakub Narębski