Re: [RFC] Possible idea for GSoC 2020

Jakub Narebski <jnareb@xxxxxxxxx> · Tue, 17 Mar 2020 19:05:37 +0100

Abhishek Kumar <abhishekkumar8222@xxxxxxxxx> writes:

>> Having such a complicated two-dimensional system would need to
>> justify itself by being measurably faster than that one-dimensional
>> system in these example commands.
>>
>> [...]
>>
>> My _prediction_ is that the two-dimensional system will be more
>> complicated to write and use, and will not have any measurable
>> difference. I'd be happy to be wrong, but I also would not send
>> anyone down this direction only to find out I'm right and that
>> effort was wasted.
>
> Agreed. I have been through the papers of the involved variants and on graphs
> comparable to some of the largest git repositories, the performance improves by
> fifty nanoseconds for a random query.

I would recommend extending results for other types of large graphs to
the commit graphs with care.  The characteristics of those graphs are
quite different from characteristics of commit graph: they usually are
scale-free graphs, with low maximum level, and low connectivity: the
probability of two random nodes being connected in order of 10^-3 or
10^-4; see e.g. https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=99

The last one, called R-ratio, means that testing on random query
actually tests mainly negative-cut filters.  That is why some papers
provide either separate numbers for negative and for positive queries,
or separate numbers for random and for balanced queries.

> Additionally:
> 1. They require significantly more space per commit.

This depends on the type of the algorithm: is it Label-Only (answering
reachability queries without consulting graph), or Label+Graph
(augmented online search algorithms):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=78

The CDAT chunk in current version of commit-graph format takes H+16
bytes per commit (where H is the size of object id hash).  From those
H+16 bytes 30 bits (slightly less that 4 bytes) are used for current
reachability label: the topological level aka generation number.
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=45

The proposed min-post interval label would take 8 bytes per commit, that
is 4 bytes per single number in interval.  That is not much, provided
that we get visible performance improvements for at least some often
used git commands.

> 2. They require significantly more preprocessing time.

This again depends on the type of algorithm: Label-Only or Label+G.

In the case of min-post interval labels, they can be computed together
with generation number, during the same commit-graph walk.  The amount
of calculations required to compute min-post interval is not much.
Therefore I think it would be not unreasonable cost.

>> My recommendation is that a GSoC student update the
>> generation number to "v2" based on the definition you made in [1].
>> That proposal is also more likely to be effective in Git because
>> it makes use of extra heuristic information (commit date) to
>> assist the types of algorithms we care about.
>
>> In that case, the "difficult" part is moving the "generation"
>> member of struct commit into a slab before making it a 64-bit
>> value. (This is likely necessary for your plan, anyway.) Updating
>> the generation number to v2 is relatively straight-forward after
>> that, as someone can follow all places that reference or compute
>> generation numbers and apply a diff
>
> Thanks for the recommendation. Reading about how this fits in more
> with REU on the other thread, I too agree that updating generation
> number to use corrected commit date would be more appropriate for a GSoC
> project.

You can read about monotonically offset corrected commit date for
example on my slides, from page 64 (the problem) to 75 (references):
https://speakerdeck.com/jnareb/graph-operations-in-git-and-how-to-make-them-faster?slide=64

I'll try to update the proposal on SoC 2020 Ideas page soon, perhaps
tomorrow when I will have more free time.

https://git.github.io/SoC-2020-Ideas/

In the meantime I will refer to what is written at the top of it:

> **Students:** Please consider these ideas as starting points for
> generating proposals. We are also more than happy to receive proposals
> for other ideas related to Git. Make sure you have read the “Note
> about refactoring projects versus projects that implement new
> features” in the general application information page though.

Best,
-- 
Jakub Narębski