Re: [RFC] Possible idea for GSoC 2020

Jakub Narebski <jnareb@xxxxxxxxx> · Thu, 19 Mar 2020 13:50:00 +0100

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:

> Greetings Jakub
>
>> So perhaps we should expand "Commit graph labeling for speeding up git
>> commands" idea, splitting it into two possible ways of working in this
>> project: the more technical 'Generation number v2', and 'Interval labels
>> for commit graph' which is more of research project?  Which should be
>> put first, then?
>
> I would suggest working on generation number v2 first because:
> - We ship improved performance *sooner*.
> - I find it easier to shift from simpler to more complex systems.

It would be also more of an engineering project, than research one.

> On a personal note, I would be able to do a better job at working on generation
> number v2 than on interval labels. While reading through the papers mentioned
> was fun and full of a-ha moments, I find building things more
> fulfilling. I would be glad if either you or Derrick opt for mentoring it.

All right.  I have added 'Generation number v2' as possible way of
working on the "Commit graph labeling for speeding up git commands"
idea.

https://git.github.io/SoC-2020-Ideas/#commit-graph-labeling-for-speeding-up-git-commands

> You could read my proposal at the link below. It is very rough and I haven't
> proofread it yet. I will send out a more formal proposal once the direction
> of this project is decided.
>
> https://github.com/abhishekkumar2718/GSoC20/blob/master/graph_labelling.md
>
> **Too long, didn't read**

I will comment only on the TLDR version below.

> - Commit graphs are small to medium-sized (compared to problem sizes observed in
> graph-theory literature) sparse graphs. They have unusual properties compared
> to more conventional graphs that can be exploited for better performance.

Right.

> - Most of git's reachability queries are negative and using a negative-cut
> filter improves performance more than a positive-cut filter.

Not exactly.  Most time consuming git graph queries are those that
return a subgraph or otherwise need to walk the commit graph (like
topological sorting, or ahead/behind numbers).  Here in most cases
negative-cut filters, that reduce the number of commits walked, improve
performance more (with rare exceptions, like --ancestry-path query).

That is what I think, but it is not something I have proven...

> - Implementing and maintaining a two-dimensional reachability index is hard
> and does not offer justifiable performance gains.

I think it would be better to not mention graph reachability algorithms
at all, and just talk about generation number v2 as straight performance
improvement over generation number v1.

> - We plan to use corrected commit date as the generation number v2 because it is
> locally computable, immutable and can be incrementally updated.

Unfortunately because of an oopsie with respect to commit-graph file
format versioning, at least for the time being generation number v2
needs to be also backward compatibile with generation number v1.  See
for example:
  https://git.github.io/rev_news/2019/06/28/edition-52/#reviews
  https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=74

>
> - If git ever considers a two-dimensional reachability index, either post order
> DFS, GRAIL or an index based on commit date would be good starting
> places to explore.
>
> I go in more details about GRAIL, FERRARI and PReaCH, explaining
> briefly how they work, their advantages, and disadvantages.

Again, I don't think that is necessary.

Best,
-- 
Jakub Narębski