Cc: Stolee, Heba, Jonathan T., Emily Shaffer. Junio C Hamano <gitster@xxxxxxxxx> writes: > Jakub Narebski <jnareb@xxxxxxxxx> writes: > >> A few questions: >> - is it too late to propose a new project idea for GSoC 2020? >> - is it too difficult of a project for GSoC? >> ... >> ### Graph labelling for speeding up git commands >> >> - Language: C >> - Difficulty: hard / difficult >> - Possible mentors: Jakub Narębski > > I am not running the GSoC or participating in it in any way other > than just being a reviewer-maintainer of the project, but I would > appreciate a well-thought-out write-up very much. I have prepared slides for "Graph operations in Git version control system" (PDF), mainly describing what was already done to improve their performance, but they also include a few thoughts about the future (like additional graph reachability labelings)... unfortunately the slides are in Polish, not in English. If there is interest, I could translate them, and put the result somewhere accessible. Or I could try to make this information into blog post -- this topic would really gain from using images (like Derrick Stolee series of articles on commit-graph). >> Git uses various clever methods for making operations on very large >> repositories faster, from bitmap indices for git-fetch[1], to generation >> numbers (also known as topological levels) in the commit-graph file for >> commit graph traversal operations like `git log --graph`[2]. >> >> One possible improvement that can make Git even faster is using min-post >> intervals labelling. The basis of this labelling is post-visit order of >> a depth-first search traversal tree of a commit graph, let's call it >> 'post(v)'. >> >> If for each commit 'v' we would compute and store in the commit-graph >> file two numbers: 'post(v)' and the minimum of 'post(u)' for all commits >> reachable from 'v', let's call the latter 'min_graph(v)', then the >> following condition is true: >> >> if 'v' can reach 'u', then min_graph(v) <= post(u) <= post(v) >> >> If for each commit 'v' we would compute and store in the commit-graph >> file two numbers: 'post(v)' and the minimum of 'post(u)' for commits >> that were visited during the part of depth-first search that started >> from 'v' (which is the minimum of post-order number for subtree of a >> spanning tree that starts at 'v'). Let's call the later 'min_tree(v)'. >> Then the following condition is true: >> >> if min_tree(v) <= post(u) <= post(v), then 'v' can reach 'u' >> >> The task would be to implement computing such labelling (or a more >> involved variant of it[3][4]), storing it in commit-graph file, and >> using it for speeding up git commands (starting from a single chosen >> command) such as: >> >> - git merge-base --is-ancestor A B >> - git branch --contains A >> - git tag --contains A >> - git branch --merged A >> - git tag --merged A >> - git merge-base --all A B >> - git log --topo-sort >> >> References: >> >> 1. <http://githubengineering.com/counting-objects/> >> 2. <https://devblogs.microsoft.com/devops/supercharging-the-git-commit-graph-iii-generations/> >> 3. <https://arxiv.org/abs/1404.4465> >> 4. <https://github.com/steps/Ferrari> >> >> See also discussion in: >> >> <https://public-inbox.org/git/86tvl0zhos.fsf@xxxxxxxxx/t/> P.S. A bit more expanded writeup now available at https://git.github.io/SoC-2020-Ideas/ Best, -- Jakub Narębski