Re: [RFC] Possible idea for GSoC 2020

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 16 Mar 2020 08:44:54 -0400

On 3/10/2020 10:50 AM, Jakub Narebski wrote:
> Hello,
> 
> Here below is a possible proposal for a more difficult Google Summer of
> Code 2020 project.
> 
> A few questions:
> - is it too late to propose a new project idea for GSoC 2020?
> - is it too difficult of a project for GSoC?
> 
> Best,
> 
>   Jakub Narębski
> 
> --------------------------------------------------
> 
> ### Graph labelling for speeding up git commands
> 
>  - Language: C
>  - Difficulty: hard / difficult
>  - Possible mentors: Jakub Narębski
> 
> Git uses various clever methods for making operations on very large
> repositories faster, from bitmap indices for git-fetch[1], to generation
> numbers (also known as topological levels) in the commit-graph file for
> commit graph traversal operations like `git log --graph`[2].
> 
> One possible improvement that can make Git even faster is using min-post
> intervals labelling.  The basis of this labelling is post-visit order of
> a depth-first search traversal tree of a commit graph, let's call it
> 'post(v)'.
> 
> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits
> reachable from 'v', let's call the latter 'min_graph(v)', then the
> following condition is true:
> 
>   if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v)

I haven't thought too hard about it, but I'm assuming that if v is not
in a commit-graph file, then post(v) would be "infinite" and min_graph(v)
would be zero.

We already have the second inequality (f(u) <= f(v)) where the function
'f' is the generation of v. The success of this approach over generation
numbers relies entirely on how often the inequality min_graph(v) <= post(u)
fails when gen(u) <= gen(v) holds.

> If for each commit 'v' we would compute and store in the commit-graph
> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> that were visited during the part of depth-first search that started
> from 'v' (which is the minimum of post-order number for subtree of a
> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> Then the following condition is true:
> 
>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'

How many places in Git do we ask "can v reach u?" and how many would
return immediately without needing a walk in this new approach? My
guess is that we will have a very narrow window where this query
returns a positive result.

I believe we discussed this concept briefly when planning "generation
number v2" and the main concern I have with this plan is that the
values are not stable. The value of post(v) and min_tree(v) depend
on the entire graph as a whole, not just what is reachable from v
(and preferably only the parents of v).

Before starting to implement this, I would consider how such labels
could be computed across incremental commit-graph boundaries. That is,
if I'm only adding a layer of commits to the commit-graph without
modifying the existing layers of the commit-graph chain, can I still
compute values with these properties? How expensive is it? Do I need
to walk the entire reachable set of commits?

> The task would be to implement computing such labelling (or a more
> involved variant of it[3][4]), storing it in commit-graph file, and
> using it for speeding up git commands (starting from a single chosen
> command) such as:
> 
>  - git merge-base --is-ancestor A B
>  - git branch --contains A
>  - git tag --contains A
>  - git branch --merged A
>  - git tag --merged A
>  - git merge-base --all A B
>  - git log --topo-sort

Having such a complicated two-dimensional system would need to
justify itself by being measurably faster than that one-dimensional
system in these example commands.

The point of generation number v2 [1] was to allow moving to "exact"
algorithms for things like merge-base where we still use commit time
as a heuristic, and could be wrong because of special data shapes.
We don't use generation number in these examples because using only
generation number can lead to a large increase in number of commits
walked. The example we saw in the Linux kernel repository was a bug
fix created on top of a very old commit, so there was a commit of
low generation with very high commit-date that caused extra walking.
(See [2] for a detailed description of the data shape.)

My _prediction_ is that the two-dimensional system will be more
complicated to write and use, and will not have any measurable
difference. I'd be happy to be wrong, but I also would not send
anyone down this direction only to find out I'm right and that
effort was wasted.

My recommendation is that a GSoC student update the
generation number to "v2" based on the definition you made in [1].
That proposal is also more likely to be effective in Git because
it makes use of extra heuristic information (commit date) to
assist the types of algorithms we care about.

In that case, the "difficult" part is moving the "generation"
member of struct commit into a slab before making it a 64-bit
value. (This is likely necessary for your plan, anyway.) Updating
the generation number to v2 is relatively straight-forward after
that, as someone can follow all places that reference or compute
generation numbers and apply a diff

Thanks,
-Stolee

[1] https://lore.kernel.org/git/86o8ziatb2.fsf_-_@xxxxxxxxx/
    [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling)

[2] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@xxxxxxxxx/
    [PATCH 1/1] commit: don't use generation numbers if not needed