Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date

Derrick Stolee <stolee@xxxxxxxxx> · Tue, 29 Dec 2020 23:35:56 -0500

On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> This patch series implements the corrected commit date offsets as generation
> number v2, along with other pre-requisites.

Abhishek,

Thank you for this version. I appreciate your hard work on this topic,
especially after GSoC ended and you returned to being a full-time student.

My hope was that I could completely approve this series and only provide
forward-fixes from here on out, as necessary. I think there are a few minor
typos that you might want to address, but I was also able to understand your
intention.

I did make a particular case about a SEGFAULT I hit that I have been unable
to replicate. I saw it both in my copy of torvalds/linux and of
chromium/chromium. I have the file for chromium/chromium that is in a bad
state where a GDAT value includes the bit saying it should be in the long
offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
exist in a from-scratch write.

I'm now taking backups of my existing commit-graph files before any later
test, but it doesn't repro for my Git repository or any other repo I try on
purpose.

However, I did some performance testing to double-check your numbers. I sent
a patch [1] that helps with some of the hard numbers.

[1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@xxxxxxxxx/

The big question is whether the overhead from using a slab to store the
generation values is worth it. I still think it is, for these reasons:

1. Generation number v2 is measurably better than v1 in most user cases.

2. Generation number v2 is slower than using committer date due to the
   overhead, but _guarantees correctness_.

I like to use "git log --graph -<N>" to compare against topological levels
(v1), for various levels of <N>. When <N> is small, we hope to minimize
the amount we need to walk using the extra commit-date information as an
assistance. Repos like git/git and torvalds/linux use the philosophy of
"base your changes on oldest applicable commit" enough that v1 struggles
sometimes.

git/git: N=1000

	Benchmark #1: baseline
	Time (mean ± σ):     100.3 ms ±   4.2 ms    [User: 89.0 ms, System: 11.3 ms]
	Range (min … max):    94.5 ms … 105.1 ms    28 runs

	Benchmark #2: test
	Time (mean ± σ):      35.8 ms ±   3.1 ms    [User: 29.6 ms, System: 6.2 ms]
	Range (min … max):    29.8 ms …  40.6 ms    81 runs

	Summary
	'test' ran
	2.80 ± 0.27 times faster than 'baseline'

This is a dramatic improvement! Using my topo-walk stats commit, I see that
v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
4,335 commits!

torvalds/linux: N=1000 (starting at v5.10)

	Benchmark #1: baseline
	Time (mean ± σ):      90.8 ms ±   3.7 ms    [User: 75.2 ms, System: 15.6 ms]
	Range (min … max):    85.2 ms …  96.2 ms    31 runs

	Benchmark #2: test
	Time (mean ± σ):      49.2 ms ±   3.5 ms    [User: 36.9 ms, System: 12.3 ms]
	Range (min … max):    42.9 ms …  54.0 ms    61 runs

	Summary
	'test' ran
	1.85 ± 0.15 times faster than 'baseline'

Similarly, v1 walked 38,161 commits compared to 4,340 by v2.

If I increase N to something like 10,000, then usually these values get
washed out due to the width of the parallel topics.

The place we were still using commit-date as a heuristic was paint_down_to_common
which caused a regression the first time we used v1, at least for certain cases.

Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
hit a strangeness about a pair of recent commits both based on a very old commit,
but the generation numbers forced walking farther than necessary. This doesn't
happen with v2, but we see the overhead cost of the slabs:

	Benchmark #1: baseline
	Time (mean ± σ):     112.9 ms ±   2.8 ms    [User: 96.5 ms, System: 16.3 ms]
	Range (min … max):   107.7 ms … 118.0 ms    26 runs

	Benchmark #2: test
	Time (mean ± σ):     147.1 ms ±   5.2 ms    [User: 132.7 ms, System: 14.3 ms]
	Range (min … max):   141.4 ms … 162.2 ms    18 runs

	Summary
	'baseline' ran
	1.30 ± 0.06 times faster than 'test'

The overhead still exists for a more recent pair of versions (v5.0 and v5.1):

	Benchmark #1: baseline
	Time (mean ± σ):      25.1 ms ±   3.2 ms    [User: 18.6 ms, System: 6.5 ms]
	Range (min … max):    19.0 ms …  32.8 ms    99 runs

	Benchmark #2: test
	Time (mean ± σ):      33.3 ms ±   3.3 ms    [User: 26.5 ms, System: 6.9 ms]
	Range (min … max):    27.0 ms …  38.4 ms    105 runs

	Summary
	'baseline' ran
	1.33 ± 0.22 times faster than 'test'

I still think this overhead is worth it. In case not everyone agrees, it _might_
be worth a command-line option to skip the GDAT chunk. That also prevents an
ability to eventually wean entirely of generation number v1 and allow the commit
date to take the full 64-bit column (instead of only 34 bits, saving 30 for
topo-levels).

Again, such a modification should not be considered required for this series.

> ----------------------------------------------------------------------------
> 
> Improvements left for a future series:
> 
>  * Save commits with generation data overflow and extra edge commits instead
>    of looping over all commits. cf. 858sbel67n.fsf@xxxxxxxxx
>  * Verify both topological levels and corrected commit dates when present.
>    cf. 85pn4tnk8u.fsf@xxxxxxxxx

These seem like reasonable things to delay for a later series
or for #leftoverbits

Thanks,
-Stolee