Re: [RFC] Possible idea for GSoC 2020

Christian Couder <christian.couder@xxxxxxxxx> · Tue, 17 Mar 2020 08:24:22 +0100

On Tue, Mar 17, 2020 at 4:13 AM Jakub Narebski <jnareb@xxxxxxxxx> wrote:

[...]

> >> ### Graph labelling for speeding up git commands

[...]

> > We already have the second inequality (f(u) <= f(v)) where the function
> > 'f' is the generation of v. The success of this approach over generation
> > numbers relies entirely on how often the inequality min_graph(v) <= post(u)
> > fails when gen(u) <= gen(v) holds.
>
> True.  It may turn out that additional negative-cut filters do not bring
> enough performance improvements over topological levels or corrected
> commit date (or monotonically increasing corrected commit date) to be
> worth it.
>
> I think they can help in wide commit graphs (many concurrently developed
> branches with many commits and few merges), and when there is orphan
> branch (like 'todo' in the git.git, or 'gh-pages' for storing
> per-project GitHub Pages) that is somehow entangled in query.
>
> >> If for each commit 'v' we would compute and store in the commit-graph
> >> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits
> >> that were visited during the part of depth-first search that started
> >> from 'v' (which is the minimum of post-order number for subtree of a
> >> spanning tree that starts at 'v').  Let's call the later 'min_tree(v)'.
> >> Then the following condition is true:
> >>
> >>   if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u'
> >
> > How many places in Git do we ask "can v reach u?" and how many would
> > return immediately without needing a walk in this new approach? My
> > guess is that we will have a very narrow window where this query
> > returns a positive result.
>
> As I wrote below, such positive-cut filter would be directly helpful in
> performing the following commands:
>
>  - `git merge-base --is-ancestor`
>  - `git branch --contains`
>  - `git tag --contains`
>  - `git branch --merged`
>  - `git tag --merged`
>
> It would be also useful for tag autofollow in git-fetch; is is N-to-M
> equivalent to 1-to-N / N-to-1 `--contains` queries.
>
> I am quite sure that positive-cut filter would make `--ancestry-path`
> walk faster.
>
> I think, but I am not sure, that positive-cut filter can make parts of
> topological sort and merge base algorithms at least a tiny bit faster.

Is there an easy way to check that it would provide significant
performance improvements at least in some cases? Can we ask the
student to do that at the beginning of the GSoC?

> > I believe we discussed this concept briefly when planning "generation
> > number v2" and the main concern I have with this plan is that the
> > values are not stable. The value of post(v) and min_tree(v) depend
> > on the entire graph as a whole, not just what is reachable from v
> > (and preferably only the parents of v).
> >
> > Before starting to implement this, I would consider how such labels
> > could be computed across incremental commit-graph boundaries. That is,
> > if I'm only adding a layer of commits to the commit-graph without
> > modifying the existing layers of the commit-graph chain, can I still
> > compute values with these properties? How expensive is it? Do I need
> > to walk the entire reachable set of commits?
>
> I think it would be possible to compute post(v) and min_tree(v) using
> incremental updates, and to make it compatibile with incremental
> commit-graph format (with the commit-graph chain).  But I have not
> proven it.

Would it be difficult to prove? What would be required? And again can
we ask the student to do that at the beginning of the GSoC?

[...]

> > The point of generation number v2 [1] was to allow moving to "exact"
> > algorithms for things like merge-base where we still use commit time
> > as a heuristic, and could be wrong because of special data shapes.
> > We don't use generation number in these examples because using only
> > generation number can lead to a large increase in number of commits
> > walked. The example we saw in the Linux kernel repository was a bug
> > fix created on top of a very old commit, so there was a commit of
> > low generation with very high commit-date that caused extra walking.
> > (See [2] for a detailed description of the data shape.)
> >
> > My _prediction_ is that the two-dimensional system will be more
> > complicated to write and use, and will not have any measurable
> > difference. I'd be happy to be wrong, but I also would not send
> > anyone down this direction only to find out I'm right and that
> > effort was wasted.
>
> That might be a problem.
>
> This is a bit of a "moonshot" / research project, moreso than others.
> Though it would be still valuable, in my opionion, even if the code
> wouldn't ultimately get merged and added into Git.

I agree that it feels like a "moonshot" / research project.

> > My recommendation is that a GSoC student update the
> > generation number to "v2" based on the definition you made in [1].
> > That proposal is also more likely to be effective in Git because
> > it makes use of extra heuristic information (commit date) to
> > assist the types of algorithms we care about.
> >
> > In that case, the "difficult" part is moving the "generation"
> > member of struct commit into a slab before making it a 64-bit
> > value. (This is likely necessary for your plan, anyway.) Updating
> > the generation number to v2 is relatively straight-forward after
> > that, as someone can follow all places that reference or compute
> > generation numbers and apply a diff
>
> Good idea!  Though I am not sure if it is not too late to add it to the
> https://git.github.io/SoC-2020-Ideas/ as the self imposed deadline of
> March 16 (where students can start submitting proposals to GSoC) has
> just passed.  Christian, what do you think?

Would that be a different project idea or part of your "Graph labeling
for speeding up git commands" project idea?

I am very reluctant to add new project ideas at that time. I don't
think student will have time to properly research it and get it
reviewed.

It could be part of your research project though, to check if that
approach is better or good enough compared to what you suggest in the
current version of your project.

> Would you agree, Stolee, to be a _possible_ mentor or co-mentor for
> "Generation number v2" project?

At this point I think it might be best if you are both willing to
co-mentor a "moonshot" / research project to find what is the best way
forward by bench-marking the different approaches that you both
suggest for different commands/use cases.