On Mon, Jul 13, 2020 at 10:00:03PM +0200, Jakub Narębski wrote: > Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes: > > > Hello everyone! > > > > Over the last two weeks, I have worked on refining the performance > > report on generation numbers. Here are our conclusions: > > > > - Corrected Commit Dates With Monotonically Offset (i.e. generation > > number v5) performs better than topological levels but is still walks > > too many commits when compared with Corrected Commit Dates. > > Thank you for your work examining different approaches to introducing > generation number v2. > > > Number of commits walked (git merge-base v4.8 v4.9, on linux repository): > > > > Topological Level : 635579 > > Corrected Commit Date : 167468 > > Corrected Commit Date With Monotonic Offset: 506577 > > It is a bit strange that requiring monotonic offsets leads to so much > of a difference in performance (in commits walked). > > > > > As such, I am expecting that we will store Corrected Commit Date in an > > additional chunk (called "generation data chunk") and store topological > > levels into CDAT. Thus, old Git clients can operate as expected, with > > new Git clients using the better generation number. > > > > - Using a new chunk does affect the locality of reference but did not > > impact the performance appreciably. > > - This does increase the size of commit graph file by nearly 5%. > > All right, it seems like it is the way to go. > > > You can read more in my report [1] and the pull request with > > instructions to replicate the results [2]. > > > > [1]: https://lore.kernel.org/git/20200703082842.GA28027@Abhishek-Arch/T/#mda33f6e13873df55901768e8fd6d774282002146 > > [2]: https://github.com/abhishekkumar2718/git/pull/1 > > > > I talk a bit more about a patch I worked on, trying to improve > > performance of commit graph write using buffers which ultimately did not > > work and is dropped. Up next is actually implementing the generation > > number and take care of all little details. > > > > https://abhishekkumar2718.github.io/programming/2020/07/05/gsoc-weeks-4-5.html > > > > Feedback and suggestions welcome! > > Some comments about the blog entry contents: > > AK> Dr. Stolee pointed out ... [to] use the number of commits as a > AK> metric instead of wall clock timing (which can be influenced by other > AK> factors like CPU usage at the time). > > There are a few factors. If we compare similar algorithms, that might > be a good decision. > > First, one can try to reduce the influence of random factors on the wall > clock timing by using statistics. For example one can try to detect and > remove outliers by using robust statistics measures to detect them, like > tools like for example Dumbbench [3], hyperfine [4] or bench [5]. After > warmup, one approach is to compute the robust estimate of value, e.g. > median, and robust estimate of dispersion, e.g. MAD = median absolute > deviation, and use those to detect outliers, e.g. rescale MAD and mark > as outlier and remove entries that are more than "three sigma" of robust > dispersion away from robust estimate of value. Dumbbench [3] has good > explanation. > > [3]: https://metacpan.org/pod/Dumbbench#HOW-IT-WORKS-AND-WHY-IT-DOESN'T > [4]: https://github.com/sharkdp/hyperfine > [5]: https://github.com/Gabriel439/bench That's interesting. When you think about it, medians are a better measure than average because medians are robust to the outliers. > > Second, because of pecularities of current processor architecture > (caches, data prefetching, branch prediction) performing more operations > might in admittedly rare cases be faster than doing less operations. One > such example can be found in the CppCon 2019 talk by Andrei Alexandrescu > "Speed Is Found In The Minds of People" [6][7] about 'small sort', where > doing more operations results in, on average, faster sort. This of > course has a possibility to happen only if difference with the number of > operations is small enough... nevertheless it might be a good idea to at > least check that the wall clock time agrees with conclusions from the > number of commits walked, for at least a few examples. > > [6]: https://www.youtube.com/watch?v=FJJTYQYB1JQ > [7]: https://github.com/CppCon/CppCon2019/blob/master/Presentations/speed_is_found_in_the_minds_of_people/speed_is_found_in_the_minds_of_people__andrei_alexandrescu__cppcon_2019.pdf > > AK> With the second report, storing corrected commit date in GDAT as > AK> well as computing topological levels seems like a no-brainer. I have > AK> started working on the patch and will push to the mailing list after > AK> some discussion on the report. > > Do you have any numbers how much does providing backward compatibility > cost at `git commit-graph write`, that is how much more time it takes to > computer topological levels during computation of corrected > committerdate compared to storing GENERATION_NUMBER_MAX in place of > topological level, and whether having topological level (as tie-breaker) > helps with Git performance when using commit-graphh for querying? Does > having topological levels as tie-breaker or secondary negative-cut > reachability index helps at all? > We do have timings comparing the time to compute topological levels as compared to storing GENERATION_NUMBER_MAX in place [1]: Writing GENERATION_NUMBER_MAX to commit-graph: 14.175s Writing topological levels to commit-graph: 14.331s That's around 160ms and 1% percent faster. I do think there's a case to be made for GENERATION_NUMBER_MAX because the performance degradation for old Git would help in faster adoption (Junio was in favor of this, the last time we discussed alternatives [2]). It is a double-edged sword as we force people who cannot upgrade git into worse performance. I do not have anything for using topological level as a tie-breaker. Will benchmark and get back to you. [1]: https://lore.kernel.org/git/20200630150056.GA4111@Abhishek-Arch/ [2]: https://lore.kernel.org/git/xmqq8sjp1mnz.fsf@xxxxxxxxxxxxxxxxxxxxxx/ > > Thank you for your work and for the report. > > P.S. Would it be possible to put GSoC entries into separate 'GSoC' > category instead of generic 'Programming' one, or add a 'GSoC' tag? > Great idea! Try this out: https://abhishekkumar2718.github.io/gsoc/ > Best, > -- > Jakub Narębski