I hope that I am addressing the most recent version of this series. Derrick Stolee <stolee@xxxxxxxxx> writes: > As promised [1], this patch contains a way to serialize the commit graph. > The current implementation defines a new file format to store the graph > structure (parent relationships) and basic commit metadata (commit date, > root tree OID) in order to prevent parsing raw commits while performing > basic graph walks. For example, we do not need to parse the full commit > when performing these walks: > > * 'git log --topo-order -1000' walks all reachable commits to avoid > incorrect topological orders, but only needs the commit message for > the top 1000 commits. > > * 'git merge-base <A> <B>' may walk many commits to find the correct > boundary between the commits reachable from A and those reachable > from B. No commit messages are needed. > > * 'git branch -vv' checks ahead/behind status for all local branches > compared to their upstream remote branches. This is essentially as > hard as computing merge bases for each. > > The current patch speeds up these calculations by injecting a check in > parse_commit_gently() to check if there is a graph file and using that > to provide the required metadata to the struct commit. That's nice. What are the assumptions about the serialized commit graph format? Does it need to be: - extensible without rewriting (e.g. append-only)? - like the above, but may need rewriting for optimal performance? - extending it needs to rewrite whole file? Excuse me if it waas already asked and answered. > > The file format has room to store generation numbers, which will be > provided as a patch after this framework is merged. Generation numbers > are referenced by the design document but not implemented in order to > make the current patch focus on the graph construction process. Once > that is stable, it will be easier to add generation numbers and make > graph walks aware of generation numbers one-by-one. As the serialized commit graph format is versioned, I wonder if it would be possible to speed up graph walks even more by adding to it FELINE index (pair of numbers) from "Reachability Queries in Very Large Graphs: A Fast Refined Olnine Search Approach" (2014) - available at http://openproceedings.org/EDBT/2014/paper_166.pdf The implementation would probably need adjustments to make it unambiguous and unambiguously extensible; unless there is place for indices that are local-only and need to be recalculated from scratch when graph changes (to cover all graph). > > Here are some performance results for a copy of the Linux repository > where 'master' has 704,766 reachable commits and is behind 'origin/master' > by 19,610 commits. > > | Command | Before | After | Rel % | > |----------------------------------|--------|--------|-------| > | log --oneline --topo-order -1000 | 5.9s | 0.7s | -88% | > | branch -vv | 0.42s | 0.27s | -35% | > | rev-list --all | 6.4s | 1.0s | -84% | > | rev-list --all --objects | 32.6s | 27.6s | -15% | That's the "Rel %" of "Before", that is delta/before, isn't it? > To test this yourself, run the following on your repo: > > git config core.commitGraph true > git show-ref -s | git commit-graph write --set-latest --stdin-commits > > The second command writes a commit graph file containing every commit > reachable from your refs. Now, all git commands that walk commits will > check your graph first before consulting the ODB. You can run your own > performance comparisions by toggling the 'core.commitgraph' setting. Good. It is nicely similar to how bitmap indices (of reachability) are handled. I just wonder what happens in the (rare) presence of grafts (old mechanism), or "git replace"-d commits... Regards, -- Jakub Narębski