Re: [PATCH v3 00/11] [GSoC] Implement Corrected Commit Date

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



"Abhishek Kumar via GitGitGadget" <gitgitgadget@xxxxxxxxx> writes:

> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

I'm not sure if this level of detail is required in the cover letter for
the series, but generation number v2 is corrected commit date; corrected
commit date offsets is how we store this value in the commit-graph file.

>
> Git uses topological levels in the commit-graph file for commit-graph
> traversal operations like git log --graph. Unfortunately, using topological
> levels can result in a worse performance than without them when compared
> with committer date as a heuristics. For example, git merge-base v4.8 v4.9
> on the Linux repository walks 635,579 commits using topological levels and
> walks 167,468 using committer date.

I would say "committer date heuristics" instead of just "committer
date", to be more exact.

Is this data generated using https://github.com/derrickstolee/gen-test
scripts?

>
> Thus, the need for generation number v2 was born. New generation number
> needed to provide good performance, increment updates, and backward
> compatibility. Due to an unfortunate problem, we also needed a way to
> distinguish between the old and new generation number without incrementing
> graph version.

It would be nice to have reference email (or other place with details)
for "unfortunate problem".

>
> Various candidates were examined (https://github.com/derrickstolee/gen-test,
> https://github.com/abhishekkumar2718/git/pull/1). The proposed generation
> number v2, Corrected Commit Date with Mononotically Increasing Offsets
> performed much worse than committer date (506,577 vs. 167,468 commits walked
> for git merge-base v4.8 v4.9) and was dropped.
>
> Using Generation Data chunk (GDAT) relieves the requirement of backward
> compatibility as we would continue to store topological levels in Commit
> Data (CDAT) chunk. Thus, Corrected Commit Date was chosen as generation
> number v2.

This is a bit of simplification, but good enough for a cover letter.

To be more exact, from various candidates the Corrected Commit Date was
chosen.  Then it turned out that old Git crashes on changed commit-graph
format version value, so if the generation number v2 was to replace v1
it needed to be backward-compatibile: hence the idea of Corrected Commit
Date with Monotonically Increasing Offsets.  But with GDAT chunk to
store generation number v2 (and for the time being leaving generation
number v1, i.e. Topological Levels, in CDAT), we are no longer
constrained by the requirement of backward-compatibility to make old Git
work with commit-graph file created by new Git.  So we could go back to
Corrected Commit Date, and as you wrote above the backward-compatibile
variant performs worse.

> The Corrected Commit Date is defined as:
>
> For a commit C, let its corrected commit date be the maximum of the commit
> date of C and the corrected commit dates of its parents.

Actually it needs to be "corrected commit dates of its parents plus 1"
to fulfill the reachability condition for a generation number for a
commit:

      A can reach B   =>  gen(A) < gen(B)

Of course it can be computed in simpler way, because

  max_P (gen(P) + 1)  ==  max_P (gen(P)) + 1


>                                                           Then corrected
> commit date offset is the difference between corrected commit date of C and
> commit date of C.

All right.

>
> We will introduce an additional commit-graph chunk, Generation Data chunk,
> and store corrected commit date offsets in GDAT chunk while storing
> topological levels in CDAT chunk. The old versions of Git would ignore GDAT
> chunk, using topological levels from CDAT chunk. In contrast, new versions
> of Git would use corrected commit dates, falling back to topological level
> if the generation data chunk is absent in the commit-graph file.

All right.

However I think the cover letter should also describe what should happen
in a mixed version environment (for example new Git on command line,
copy of old Git used by GUI client), and in particular what should
happen in a mixed-chain case - both for reading and for writing the
commit-graph file.

For *writing*: because old Git would create commit-graph layers without
the GDAT chunk, to simplify the behavior and make easy to reason about
commit-graph data (the situation should be not that common, and
transient -- it should get more rare as the time goes), we want the
following behavior from new Git:

- If top layer contains the GDAT chunk, or we are rewriting commit-graph
  file (--split=replace), or we are merging layers and there are no
  layers without GDAT chunk below set of layers that are merged, then

     write commit-graph file or commit-graph layer with GDAT chunk,

  otherwise

     write commit-graph layer without GDAT chunk.

  This means that there are commit-graph layers without GDAT chunk if
  and only if the top layer is also without GDAT chunk.


For *reading* we want to use generation number v2 (corrected commit
date) if possible, and fall back to generation number v1 (topological
levels).

- If the top layer contains the GDAT chunk (or maybe even if the topmost
  layer that involves all commits in question, not necessarily the top
  layer in the full commit-graph chain), then use generation number v2

  - commit_graph_data->generation stores corrected commit date,
    computed as sum of committer date (from CDAT) and offset (from GDAT)

  - A can reach B   =>  gen(A) < gen(B)

  - there is no need for committer date heuristics, and no need for
    limiting use of generation number to where there is a cutoff (to not
    hamper performance).

- If there are layers without GDAT chunks, which thanks to the write
  behavior means simply top layer without GDAT chunk, we need to turn
  off use of generation numbers or fall back to using topological levels

  - commit_graph_data->generation stores topological levels,
    taken from CDAT chunk (30-bits)

  - A can reach B   =>  gen(A) < gen(B)

  - we probably want to keep tie-breaking of sorting by generation
    number via committer date, and limit use of generation number as
    opposed to using committer date heuristics (with slop) to not make
    performance worse.

>
> Thanks to Dr. Stolee, Dr. Narębski, and Taylor for their reviews on the
> first version.
>
> I look forward to everyone's reviews!
>
> Thanks
>
>  * Abhishek
>
>
> ----------------------------------------------------------------------------
>
> Changes in version 3:
>
>  * Reordered patches as discussed in 1
>    [https://lore.kernel.org/git/aee0ae56-3395-6848-d573-27a318d72755@xxxxxxxxx/]

If I remember it correctly this was done to always store in GDAT chunk
corrected commit date offsets, isn't it?

>  * Split "implement corrected commit date" into two patches - one
>    introducing the topo level slab and other implementing corrected commit
>    dates.

All right.

I think it might be good idea to split off the change to tar file tests
(as a preparatory patch), to make reviews and bisecting easier.

>  * Extended split-commit-graph tests to verify at the end of test.

Do we also test for proper merging of split commit-graph layers, not
only adding a new layer and a full rewrite (--split=replace)?

>  * Use topological levels as generation number if any of split commit-graph
>    files do not have generation data chunk.

That is good for performance.

>
> Changes in version 2:
>
>  * Add tests for generation data chunk.

Good.

>  * Add an option GIT_TEST_COMMIT_GRAPH_NO_GDAT to control whether to write
>    generation data chunk.

Good, that is needed for testing mixed-version behavior.

>  * Compare commits with corrected commit dates if present in
>    paint_down_to_common().

All right, but see the caveat.

>  * Update technical documentation.

Always a good thing.

>  * Handle mixed graph version commit chains.

Where by "version" you mean generation number version - the commit-graph
version number unfortunately needs to stay the same...

>  * Improve commit messages for
                                ^^^^^^
Something missing in this point, the sentence ends abruptly.

>  * Revert unnecessary whitespace changes.

Thanks.

>  * Split uint_32 -> timestamp_t change into a new commit.

It is usually better to keep the commits small.  Good.


Good work!

>
> Abhishek Kumar (11):
>   commit-graph: fix regression when computing bloom filter
>   revision: parse parent in indegree_walk_step()
>   commit-graph: consolidate fill_commit_graph_info
>   commit-graph: consolidate compare_commits_by_gen
>   commit-graph: return 64-bit generation number
>   commit-graph: add a slab to store topological levels
>   commit-graph: implement corrected commit date
>   commit-graph: implement generation data chunk
>   commit-graph: use generation v2 only if entire chain does
>   commit-reach: use corrected commit dates in paint_down_to_common()
>   doc: add corrected commit date info
>
>  .../technical/commit-graph-format.txt         |  12 +-
>  Documentation/technical/commit-graph.txt      |  45 ++--
>  commit-graph.c                                | 241 +++++++++++++-----
>  commit-graph.h                                |  16 +-
>  commit-reach.c                                |  49 ++--
>  commit-reach.h                                |   2 +-
>  commit.c                                      |   9 +-
>  commit.h                                      |   4 +-
>  revision.c                                    |  13 +-
>  t/README                                      |   3 +
>  t/helper/test-read-graph.c                    |   2 +
>  t/t4216-log-bloom.sh                          |   4 +-
>  t/t5000-tar-tree.sh                           |   4 +-
>  t/t5318-commit-graph.sh                       |  27 +-
>  t/t5324-split-commit-graph.sh                 |  82 +++++-
>  t/t6024-recursive-merge.sh                    |   4 +-
>  t/t6600-test-reach.sh                         |  62 +++--
>  upload-pack.c                                 |   2 +-
>  18 files changed, 396 insertions(+), 185 deletions(-)
>
>
> base-commit: 7814e8a05a59c0cf5fb186661d1551c75d1299b5
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-676%2Fabhishekkumar2718%2Fcorrected_commit_date-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-676/abhishekkumar2718/corrected_commit_date-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/676
[...]

Best,
--
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux