Re: [GSOC] Blog about weeks 4, 5

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> · Tue, 14 Jul 2020 11:53:39 +0530

On Mon, Jul 13, 2020 at 10:00:03PM +0200, Jakub Narębski wrote:
> Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:
> 
> > Hello everyone!
> >
> > Over the last two weeks, I have worked on refining the performance
> > report on generation numbers. Here are our conclusions:
> >
> > - Corrected Commit Dates With Monotonically Offset (i.e.  generation
> >   number v5) performs better than topological levels but is still walks
> >   too many commits when compared with Corrected Commit Dates.
> 
> Thank you for your work examining different approaches to introducing
> generation number v2.
> 
> > Number of commits walked (git merge-base v4.8 v4.9, on linux repository):
> >
> > Topological Level                          : 635579
> > Corrected Commit Date                      : 167468
> > Corrected Commit Date With Monotonic Offset: 506577
> 
> It is a bit strange that requiring monotonic offsets leads to so much
> of a difference in performance (in commits walked).
> 
> >
> > As such, I am expecting that we will store Corrected Commit Date in an
> > additional chunk (called "generation data chunk") and store topological
> > levels into CDAT. Thus, old Git clients can operate as expected, with
> > new Git clients using the better generation number.
> >
> > - Using a new chunk does affect the locality of reference but did not
> >   impact the performance appreciably.
> > - This does increase the size of commit graph file by nearly 5%.
> 
> All right, it seems like it is the way to go.
> 
> > You can read more in my report [1] and the pull request with
> > instructions to replicate the results [2].
> >
> > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146
> > [2]: https://github.com/abhishekkumar2718/git/pull/1
> >
> > I talk a bit more about a patch I worked on, trying to improve
> > performance of commit graph write using buffers which ultimately did not
> > work and is dropped. Up next is actually implementing the generation
> > number and take care of all little details.
> >
> > https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html
> >
> > Feedback and suggestions welcome!
> 
> Some comments about the blog entry contents:
> 
> AK> Dr. Stolee pointed out ... [to] use the number of commits as a
> AK> metric instead of wall clock timing (which can be influenced by other
> AK> factors like CPU usage at the time).
> 
> There are a few factors.  If we compare similar algorithms, that might
> be a good decision.
> 
> First, one can try to reduce the influence of random factors on the wall
> clock timing by using statistics.  For example one can try to detect and
> remove outliers by using robust statistics measures to detect them, like
> tools like for example Dumbbench [3], hyperfine [4] or bench [5].  After
> warmup, one approach is to compute the robust estimate of value, e.g.
> median, and robust estimate of dispersion, e.g. MAD = median absolute
> deviation, and use those to detect outliers, e.g. rescale MAD and mark
> as outlier and remove entries that are more than "three sigma" of robust
> dispersion away from robust estimate of value.  Dumbbench [3] has good
> explanation.
> 
> [3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T
> [4]: https://github.com/sharkdp/hyperfine
> [5]: https://github.com/Gabriel439/bench

That's interesting. When you think about it, medians are a better
measure than average because medians are robust to the outliers.

> 
> Second, because of pecularities of current processor architecture
> (caches, data prefetching, branch prediction) performing more operations
> might in admittedly rare cases be faster than doing less operations. One
> such example can be found in the CppCon 2019 talk by Andrei Alexandrescu
> "Speed Is Found In The Minds of People" [6][7] about 'small sort', where
> doing more operations results in, on average, faster sort.  This of
> course has a possibility to happen only if difference with the number of
> operations is small enough... nevertheless it might be a good idea to at
> least check that the wall clock time agrees with conclusions from the
> number of commits walked, for at least a few examples.
> 
> [6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ
> [7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf
> 
> AK> With the second report, storing corrected commit date in GDAT as
> AK> well as computing topological levels seems like a no-brainer. I have
> AK> started working on the patch and will push to the mailing list after
> AK> some discussion on the report.
> 
> Do you have any numbers how much does providing backward compatibility
> cost at `git commit-graph write`, that is how much more time it takes to
> computer topological levels during computation of corrected
> committerdate compared to storing GENERATION_NUMBER_MAX in place of
> topological level, and whether having topological level (as tie-breaker)
> helps with Git performance when using commit-graphh for querying?  Does
> having topological levels as tie-breaker or secondary negative-cut
> reachability index helps at all?
> 
We do have timings comparing the time to compute topological levels as
compared to storing GENERATION_NUMBER_MAX in place [1]:

Writing GENERATION_NUMBER_MAX to commit-graph: 14.175s
Writing topological levels to commit-graph:    14.331s

That's around 160ms and 1% percent faster.

I do think there's a case to be made for GENERATION_NUMBER_MAX because
the performance degradation for old Git would help in faster adoption
(Junio was in favor of this, the last time we discussed alternatives [2]).
It is a double-edged sword as we force people who cannot upgrade git
into worse performance.

I do not have anything for using topological level as a tie-breaker.
Will benchmark and get back to you.

[1]: https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/
[2]: https://lore.kernel.org/git/xmqq8sjp1mnz.fsf@xxxxxxxxxxxxxxxxxxxxxx/

> 
> Thank you for your work and for the report.
> 
> P.S. Would it be possible to put GSoC entries into separate 'GSoC'
> category instead of generic 'Programming' one, or add a 'GSoC' tag?
> 

Great idea! Try this out: https://abhishekkumar2718.github.io/gsoc/

> Best,
> --
> Jakub Narębski