Re: [PATCH v5 00/11] [GSoC] Implement Corrected Commit Date

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Sun, 10 Jan 2021 19:36:38 +0530

On Tue, Dec 29, 2020 at 11:35:56PM -0500, Derrick Stolee wrote:
> On 12/28/2020 6:15 AM, Abhishek Kumar via GitGitGadget wrote:
> > This patch series implements the corrected commit date offsets as generation
> > number v2, along with other pre-requisites.
> 
> Abhishek,
> 
> Thank you for this version. I appreciate your hard work on this topic,
> especially after GSoC ended and you returned to being a full-time student.
> 
> My hope was that I could completely approve this series and only provide
> forward-fixes from here on out, as necessary. I think there are a few minor
> typos that you might want to address, but I was also able to understand your
> intention.
> 
> I did make a particular case about a SEGFAULT I hit that I have been unable
> to replicate. I saw it both in my copy of torvalds/linux and of
> chromium/chromium. I have the file for chromium/chromium that is in a bad
> state where a GDAT value includes the bit saying it should be in the long
> offsets chunk, but that chunk doesn't exist. Further, that chunk doesn't
> exist in a from-scratch write.

I hope validating mixed generation chain while writing as well was
enough to fix the SEGFAULT.

>
> I'm now taking backups of my existing commit-graph files before any later
> test, but it doesn't repro for my Git repository or any other repo I try on
> purpose.
> 
> However, I did some performance testing to double-check your numbers. I sent
> a patch [1] that helps with some of the hard numbers.
> 
> [1] https://lore.kernel.org/git/pull.828.git.1609302714183.gitgitgadget@xxxxxxxxx/
> 
> The big question is whether the overhead from using a slab to store the
> generation values is worth it. I still think it is, for these reasons:
> 
> 1. Generation number v2 is measurably better than v1 in most user cases.
> 
> 2. Generation number v2 is slower than using committer date due to the
>    overhead, but _guarantees correctness_.
> 
> I like to use "git log --graph -<N>" to compare against topological levels
> (v1), for various levels of <N>. When <N> is small, we hope to minimize
> the amount we need to walk using the extra commit-date information as an
> assistance. Repos like git/git and torvalds/linux use the philosophy of
> "base your changes on oldest applicable commit" enough that v1 struggles
> sometimes.
> 
> git/git: N=1000
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):     100.3 ms ±   4.2 ms    [User: 89.0 ms, System: 11.3 ms]
> 	Range (min … max):    94.5 ms … 105.1 ms    28 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      35.8 ms ±   3.1 ms    [User: 29.6 ms, System: 6.2 ms]
> 	Range (min … max):    29.8 ms …  40.6 ms    81 runs
> 	
> 	Summary
> 	'test' ran
> 	2.80 ± 0.27 times faster than 'baseline'
> 
> This is a dramatic improvement! Using my topo-walk stats commit, I see that
> v1 walks 58,805 commits as part of the in-degree walk while v2 only walks
> 4,335 commits!
> 
> torvalds/linux: N=1000 (starting at v5.10)
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):      90.8 ms ±   3.7 ms    [User: 75.2 ms, System: 15.6 ms]
> 	Range (min … max):    85.2 ms …  96.2 ms    31 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      49.2 ms ±   3.5 ms    [User: 36.9 ms, System: 12.3 ms]
> 	Range (min … max):    42.9 ms …  54.0 ms    61 runs
> 	
> 	Summary
> 	'test' ran
> 	1.85 ± 0.15 times faster than 'baseline'
> 
> Similarly, v1 walked 38,161 commits compared to 4,340 by v2.
> 
> If I increase N to something like 10,000, then usually these values get
> washed out due to the width of the parallel topics.

That's not too bad, as large N would be needed rather infrequently.

> 
> The place we were still using commit-date as a heuristic was paint_down_to_common
> which caused a regression the first time we used v1, at least for certain cases.
> 
> Specifically, computing the merge-base in torvalds/linux between v4.8 and v4.9
> hit a strangeness about a pair of recent commits both based on a very old commit,
> but the generation numbers forced walking farther than necessary. This doesn't
> happen with v2, but we see the overhead cost of the slabs:
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):     112.9 ms ±   2.8 ms    [User: 96.5 ms, System: 16.3 ms]
> 	Range (min … max):   107.7 ms … 118.0 ms    26 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):     147.1 ms ±   5.2 ms    [User: 132.7 ms, System: 14.3 ms]
> 	Range (min … max):   141.4 ms … 162.2 ms    18 runs
> 	
> 	Summary
> 	'baseline' ran
> 	1.30 ± 0.06 times faster than 'test'
> 
> The overhead still exists for a more recent pair of versions (v5.0 and v5.1):
> 
> 	Benchmark #1: baseline
> 	Time (mean ± σ):      25.1 ms ±   3.2 ms    [User: 18.6 ms, System: 6.5 ms]
> 	Range (min … max):    19.0 ms …  32.8 ms    99 runs
> 	
> 	Benchmark #2: test
> 	Time (mean ± σ):      33.3 ms ±   3.3 ms    [User: 26.5 ms, System: 6.9 ms]
> 	Range (min … max):    27.0 ms …  38.4 ms    105 runs
> 	
> 	Summary
> 	'baseline' ran
> 	1.33 ± 0.22 times faster than 'test'
> 
> I still think this overhead is worth it. In case not everyone agrees, it _might_
> be worth a command-line option to skip the GDAT chunk. That also prevents an
> ability to eventually wean entirely of generation number v1 and allow the commit
> date to take the full 64-bit column (instead of only 34 bits, saving 30 for
> topo-levels).

Thank you for the detailed benchmarking and discussion. 

I don't think there is any disagreement on utility of corrected commit
dates so far. 

We will run out of 34-bits for the commit date by the year 2514, so I
am not exactly worried about weaning of generation number v1 anytime
soon.

> 
> Again, such a modification should not be considered required for this series.
> 
> > ----------------------------------------------------------------------------
> > 
> > Improvements left for a future series:
> > 
> >  * Save commits with generation data overflow and extra edge commits instead
> >    of looping over all commits. cf. 858sbel67n.fsf@xxxxxxxxx
> >  * Verify both topological levels and corrected commit dates when present.
> >    cf. 85pn4tnk8u.fsf@xxxxxxxxx
> 
> These seem like reasonable things to delay for a later series
> or for #leftoverbits
> 
> Thanks,
> -Stolee
> 

Thanks
- Abhishek