On 3/10/2020 10:50 AM, Jakub Narebski wrote: > Hello, > > Here below is a possible proposal for a more difficult Google Summer of > Code 2020 project. > > A few questions: > - is it too late to propose a new project idea for GSoC 2020? > - is it too difficult of a project for GSoC? > > Best, > > Jakub Narębski > > -------------------------------------------------- > > ### Graph labelling for speeding up git commands > > - Language: C > - Difficulty: hard / difficult > - Possible mentors: Jakub Narębski > > Git uses various clever methods for making operations on very large > repositories faster, from bitmap indices for git-fetch[1], to generation > numbers (also known as topological levels) in the commit-graph file for > commit graph traversal operations like `git log --graph`[2]. > > One possible improvement that can make Git even faster is using min-post > intervals labelling. The basis of this labelling is post-visit order of > a depth-first search traversal tree of a commit graph, let's call it > 'post(v)'. > > If for each commit 'v' we would compute and store in the commit-graph > file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits > reachable from 'v', let's call the latter 'min_graph(v)', then the > following condition is true: > > if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v) I haven't thought too hard about it, but I'm assuming that if v is not in a commit-graph file, then post(v) would be "infinite" and min_graph(v) would be zero. We already have the second inequality (f(u) <= f(v)) where the function 'f' is the generation of v. The success of this approach over generation numbers relies entirely on how often the inequality min_graph(v) <= post(u) fails when gen(u) <= gen(v) holds. > If for each commit 'v' we would compute and store in the commit-graph > file two numbers: 'post(v)' and the minimum of 'post(u)' for commits > that were visited during the part of depth-first search that started > from 'v' (which is the minimum of post-order number for subtree of a > spanning tree that starts at 'v'). Let's call the later 'min_tree(v)'. > Then the following condition is true: > > if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u' How many places in Git do we ask "can v reach u?" and how many would return immediately without needing a walk in this new approach? My guess is that we will have a very narrow window where this query returns a positive result. I believe we discussed this concept briefly when planning "generation number v2" and the main concern I have with this plan is that the values are not stable. The value of post(v) and min_tree(v) depend on the entire graph as a whole, not just what is reachable from v (and preferably only the parents of v). Before starting to implement this, I would consider how such labels could be computed across incremental commit-graph boundaries. That is, if I'm only adding a layer of commits to the commit-graph without modifying the existing layers of the commit-graph chain, can I still compute values with these properties? How expensive is it? Do I need to walk the entire reachable set of commits? > The task would be to implement computing such labelling (or a more > involved variant of it[3][4]), storing it in commit-graph file, and > using it for speeding up git commands (starting from a single chosen > command) such as: > > - git merge-base --is-ancestor A B > - git branch --contains A > - git tag --contains A > - git branch --merged A > - git tag --merged A > - git merge-base --all A B > - git log --topo-sort Having such a complicated two-dimensional system would need to justify itself by being measurably faster than that one-dimensional system in these example commands. The point of generation number v2 [1] was to allow moving to "exact" algorithms for things like merge-base where we still use commit time as a heuristic, and could be wrong because of special data shapes. We don't use generation number in these examples because using only generation number can lead to a large increase in number of commits walked. The example we saw in the Linux kernel repository was a bug fix created on top of a very old commit, so there was a commit of low generation with very high commit-date that caused extra walking. (See [2] for a detailed description of the data shape.) My _prediction_ is that the two-dimensional system will be more complicated to write and use, and will not have any measurable difference. I'd be happy to be wrong, but I also would not send anyone down this direction only to find out I'm right and that effort was wasted. My recommendation is that a GSoC student update the generation number to "v2" based on the definition you made in [1]. That proposal is also more likely to be effective in Git because it makes use of extra heuristic information (commit date) to assist the types of algorithms we care about. In that case, the "difficult" part is moving the "generation" member of struct commit into a slab before making it a 64-bit value. (This is likely necessary for your plan, anyway.) Updating the generation number to v2 is relatively straight-forward after that, as someone can follow all places that reference or compute generation numbers and apply a diff Thanks, -Stolee [1] https://lore.kernel.org/git/86o8ziatb2.fsf_-_@xxxxxxxxx/ [RFC/PATCH] commit-graph: generation v5 (backward compatible date ceiling) [2] https://lore.kernel.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@xxxxxxxxx/ [PATCH 1/1] commit: don't use generation numbers if not needed