Re: [PATCH v4 00/10] [GSoC] Implement Corrected Commit Date

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Sun, 22 Nov 2020 11:01:40 +0530

On Thu, Nov 05, 2020 at 12:37:49AM +0100, Jakub Narębski wrote:
> Hi Abhishek,
> 
> "Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:
> 
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Thanks a lot for continued working on this patch series.

Thank you so much for the careful review of the series.

> 
> >
> > Git uses topological levels in the commit-graph file for commit-graph
> > traversal operations like git log --graph. Unfortunately, using topological
> > levels can result in a worse performance than without them when compared
> > with committer date as a heuristics. For example, git merge-base v4.8 v4.9 
> > on the Linux repository walks 635,579 commits using topological levels and
> > walks 167,468 using committer date.
> 
> Very minor nitpick: it would make it easier to read if the commands
> themself would be put inside single quotes or backticks, e.g. `git log
> --graph` and `git merge-base v4.8 v4.9`.

That's unexpected - I wrote the commands within single quotes in the pull
request. Since backticks are rendered as "code-tags" on Github, let me 
try single quotes.

> 
> I wonder if it is worth mentioning (probably not) that this performance
> hit was the reason why since 091f4cf3 `git merge-base` uses committer
> date heuristics unless there is a cutoff and using topological levels
> (generation date v1) is expected to give better performance.
> 

I think that's useful context for someone wondering whether we continue
to take the performance hit with topological levels or have abandoned
topological levels or chosen some another alternative altogether.

> >
> > Thus, the need for generation number v2 was born. New generation number
> > needed to provide good performance, increment updates, and backward
> > compatibility. Due to an unfortunate problem 1
> 
> Minor issue: this looks a bit strange; is there an error in formatting
> this part?

Yes. The plaintext in pull request description reads as follows:

  Thus, the need for generation number v2 was born. New generation number
  needed to provide good performance, increment updates, and backward
  compatibility. Due to an unfortunate problem [1], we also needed a way
  to distinguish between the old and new generation number without
  incrementing graph version.

  [1]: https://public-inbox.org/git/87a7gdspo4.fsf@xxxxxxxxxxxxxxxxxxx/

I have been reviewing other pull request descriptions to match their
style (and hope the cover letter renders correctly) and Dr. Stolee in
his patch series to "add --literal value" option to configuration has
written:

  As reported [1], 'git maintenance unregister' fails when a repository
  is located in a directory with regex glob characters.

  [1] https://lore.kernel.org/git/2c2db228-069a-947d-8446-89f4d3f6181a@xxxxxxxxx/T/#mb96fa4187a0d6aeda097cd95804a8aafc0273022

(Note the lack of colon after [1])

> 
> > [https://public-inbox.org/git/87a7gdspo4.fsf@xxxxxxxxxxxxxxxxxxx/], we also
> > needed a way to distinguish between the old and new generation number
> > without incrementing graph version.
> >
> > Various candidates were examined (https://github.com/derrickstolee/gen-test, 
> > https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> > number v2, Corrected Commit Date with Mononotically Increasing Offsets 
> > performed much worse than committer date (506,577 vs. 167,468 commits walked
> > for git merge-base v4.8 v4.9) and was dropped.
> >
> > Using Generation Data chunk (GDAT) relieves the requirement of backward
> > compatibility as we would continue to store topological levels in Commit
> > Data (CDAT) chunk.
> 
> Nice writeup about the history of generation number v2, much appreciated.
> 
> >                    Thus, Corrected Commit Date was chosen as generation
> > number v2. The Corrected Commit Date is defined as:
> 
> Minor nitpick: it would be probably better to use "is defined as
> follows." instead of "is defined as:".
> 
> >
> > For a commit C, let its corrected commit date be the maximum of the commit
> > date of C and the corrected commit dates of its parents plus 1. Then 
> > corrected commit date offset is the difference between corrected commit date
> > of C and commit date of C. As a special case, a root commit with timestamp
> > zero has corrected commit date of 1 to be able distinguish it from
> > GENERATION_NUMBER_ZERO (that is, an uncomputed corrected commit date).
> 
> Very minor nitpick: s/with timestamp/with *the* timestamp/, and
> s/to be able distinguish/to be able *to* distinguish/ (without the '*'
> used to mark the additions).
> 
> >
> > We will introduce an additional commit-graph chunk, Generation Data chunk,
> 
> Or "Generation DATa chunk", if we want to emphasize where its name came
> from, or even "Generation DATa (GDAT) chunk". But it is fine as it is
> now, though it would be good idea to write "Generation Data (GDAT)
> chunk" to explicitly state its name / shortcut.
> 
> > and store corrected commit date offsets in GDAT chunk while storing
> > topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> > chunk, using topological levels from CDAT chunk. In contrast, new versions
> > of Git would use corrected commit dates, falling back to topological level
> > if the generation data chunk is absent in the commit-graph file.
> 
> Nice writeup of handling the backward compatibility.
> 
> >
> > While storing corrected commit date offsets saves us 4 bytes per commit (as
> > compared with storing corrected commit dates directly), it's possible for
> > the offset to overflow the space allocated. To handle such cases, we
> > introduce a new chunk, Generation Data Overflow (GDOV) that stores the
> > corrected commit date. For overflowing offsets, we set MSB and store the
> > position into the GDOV chunk, in a mechanism similar to the Extra Edges list
> > chunk.
> 
> Very minor suggestion: perhaps it would be better to use "it's however
> possible".
> 
> Very minor suggestion: "it's possible for the offset to overflow" could
> be simplified to just "the offset can overflow"... though the simplified
> version loses a bit of hint that the overflow should be very rare in
> real repositories.
> 
> But it is just fine as it is now; I am not a native English speaker to
> judge which version is better.
> 

I think it is better to indicate the rareness of overflows.

> >
> > For mixed generation number environment (for example new Git on the command
> > line, old Git used by GUI client), we can encounter a mixed-chain
> > commit-graph (a commit-graph chain where some of split commit-graph files
> > have GDAT chunk and others do not). As backward compatibility is one of the
> > goals, we can define the following behavior:
> >
> > While reading a mixed-chain commit-graph version, we fall back on
> > topological levels as corrected commit dates and topological levels cannot
> > be compared directly.
> >
> > While writing on top of a split commit-graph, we check if the tip of the
> > chain has a GDAT chunk. If it does, we append to the chain, writing GDAT
> > chunk. Thus, we guarantee if the topmost split commit-graph file has a GDAT
> > chunk, rest of the chain does too.
> >
> > If the topmost split commit-graph file does not have a GDAT chunk (meaning
> > it has been appended by the old Git), we write without GDAT chunk. We do
> > write a GDAT chunk when the existing chain does not have GDAT chunk - when
> > we are writing to the commit-graph chain with the 'replace' strategy.
> 
> I think the last paragraph can be simplified (or added to) by explicitly
> stating the goal:
> 
>   When adding new layer to the split commit-graph file, and when merging
>   some or all layers (replacing them in the latter case), the new layer
>   will have GDAT chunk if and only if in the final result there would be
>   no layer without GDAT chunk just below it.
> 

Thanks, that is much clearer to understand.

> ...
> 
> After careful review of those 10 patches it looks like the series is
> close to being ready, requiring only small changes to progress.
> 

Thank you for writing this handy reference for changes.

> > Abhishek Kumar (10):
> >   commit-graph: fix regression when computing Bloom filters
> 
>     All good, beside possible improvement to the commit message.
>     Thanks to Taylor Blau for discovering possible reason for strange
>     no change in performance.
> 
> >   revision: parse parent in indegree_walk_step()
> 
>     Looks good.
> 
> >   commit-graph: consolidate fill_commit_graph_info
> 
>     Needs to fix now duplicated test names (minor change).
>     Proposed possible improvement to the commit message.
> 
> >   commit-graph: return 64-bit generation number
> 
>     Needs fixing due to mismerge: there should be no switch from
>     using GENERATION_NUMBER_ZERO to using GENERATION_NUMBER_INFINITY.
>     Possible minor improvement to the commit message.
> 
> >   commit-graph: add a slab to store topological levels
> 
>     Possible minor improvement to the commit message.
>     
>     There is also not very important issue, but something that would be
>     nice to explain, namely that checks for GENERATION_NUMBER_INFINITY 
>     can never be true, as topo_level_slab_at() returns 0 for commits
>     outside the commit-graph, not GENERATION_NUMBER_INFINITY.  It works
>     but it is not obvious why.
> 
> >   commit-graph: implement corrected commit date
> 
>     The change to commit-graph verification needs fixing, and we need to
>     decide how verifying generation numbers should work.  Perhaps a test
>     for handling topological level of GENERATION_NUMBER_V1_MAX could be
>     added (though this might be left for ater).
> 
>     The changes to `git commit-graph verify` code could be put into
>     separate patch, either before or after this one.
> 
> >   commit-graph: implement generation data chunk
> 
>     Proposed possible improvement to the commit message.
>     The commit message does not explain why given shape of history is
>     needed to test handling corrected commit date offset overflow.
> 
>     Proposed minor corrections to the coding style.
> 
>     Instead of looping again through all commits when handling overflow
>     in corrected commit date offsets, while there should be at most a
>     few commits needing it, why not save those commits on list and loop
>     only through those commits?  Though this _possible_ performance
>     improvement could be left to the followup...

Since the improvement can be applied to both 
`write_graph_chunk_generation_data_overflow()` and 
`write_graph_chunk_extra_edges()`, I am planning to cover this in a
followup.

> 
>     test_commit_with_date() could be instead implemented via adding
>     `--date <date>` option to test_commit() in test-lib-functions.sh.
> 
>     Also, to reduce "noise" in this patch, the rename of
>     run_three_modes() to run_all_modes() and test_three_modes() to
>     test_all_modes() could have been done in a separate preparatory
>     patch. It would be pure refactoring patch, without introducing any
>     new functionality.  But it is not something that is necessary.
> 
> >   commit-graph: use generation v2 only if entire chain does
> 
>     Proposed possible improvement to the commit message.
>     Proposed minor corrections to the coding style (also in tests).
> 
>     There is a question whether merging layers or replacing them should
>     honor GIT_TEST_COMMIT_GRAPH_NO_GDAT.
> 
>     Tests possibly could be made more strict, and check more things
>     explicitly. One test we are missing is testing that merging layers
>     is done correctly, namely that if we are merging layers in split
>     commit-graph file, and the layer below the ones we are merging lacks
>     GDAT chunk, then the result of the merge should also be without GDAT
>     chunk -- but that might be left for later.
> 
> >   commit-reach: use corrected commit dates in paint_down_to_common()
> 
>     This patch consist of two slightly interleaved changes, which
>     possibly could be separated: change to paint_down_to_common() and
>     change to t6404-recursive-merge test.
> 
>     In the commit message for the paint_down_to_common() we should
>     explicitly mention 091f4cf3, which this one partially reverts.
> 
>     Possible accidental change, question about function naming.
> 
> >   doc: add corrected commit date info
> 
>     Needs further improvements to the documentation, like adding
>     "[Optional]" to chunk description, and leftover switching from
>     "generation numbers" to "topological levels" in one place.
> 
> ...
> 
> Best,
> -- 
> Jakub Narębski

Thanks
- Abhishek