Re: [PATCH v4 00/13] Serialized Git Commit Graph

Jakub Narebski <jnareb@xxxxxxxxx> · Fri, 30 Mar 2018 13:10:05 +0200

I hope that I am addressing the most recent version of this series.

Derrick Stolee <stolee@xxxxxxxxx> writes:

> As promised [1], this patch contains a way to serialize the commit graph.
> The current implementation defines a new file format to store the graph
> structure (parent relationships) and basic commit metadata (commit date,
> root tree OID) in order to prevent parsing raw commits while performing
> basic graph walks. For example, we do not need to parse the full commit
> when performing these walks:
>
> * 'git log --topo-order -1000' walks all reachable commits to avoid
>   incorrect topological orders, but only needs the commit message for
>   the top 1000 commits.
>
> * 'git merge-base <A> <B>' may walk many commits to find the correct
>   boundary between the commits reachable from A and those reachable
>   from B. No commit messages are needed.
>
> * 'git branch -vv' checks ahead/behind status for all local branches
>   compared to their upstream remote branches. This is essentially as
>   hard as computing merge bases for each.
>
> The current patch speeds up these calculations by injecting a check in
> parse_commit_gently() to check if there is a graph file and using that
> to provide the required metadata to the struct commit.

That's nice.

What are the assumptions about the serialized commit graph format? Does
it need to be:
 - extensible without rewriting (e.g. append-only)?
 - like the above, but may need rewriting for optimal performance?
 - extending it needs to rewrite whole file?

Excuse me if it waas already asked and answered.

>
> The file format has room to store generation numbers, which will be
> provided as a patch after this framework is merged. Generation numbers
> are referenced by the design document but not implemented in order to
> make the current patch focus on the graph construction process. Once
> that is stable, it will be easier to add generation numbers and make
> graph walks aware of generation numbers one-by-one.

As the serialized commit graph format is versioned, I wonder if it would
be possible to speed up graph walks even more by adding to it FELINE
index (pair of numbers) from "Reachability Queries in Very Large Graphs:
A Fast Refined Olnine Search Approach" (2014) - available at
http://openproceedings.org/EDBT/2014/paper_166.pdf

The implementation would probably need adjustments to make it
unambiguous and unambiguously extensible; unless there is place for
indices that are local-only and need to be recalculated from scratch
when graph changes (to cover all graph).

>
> Here are some performance results for a copy of the Linux repository
> where 'master' has 704,766 reachable commits and is behind 'origin/master'
> by 19,610 commits.
>
> | Command                          | Before | After  | Rel % |
> |----------------------------------|--------|--------|-------|
> | log --oneline --topo-order -1000 |  5.9s  |  0.7s  | -88%  |
> | branch -vv                       |  0.42s |  0.27s | -35%  |
> | rev-list --all                   |  6.4s  |  1.0s  | -84%  |
> | rev-list --all --objects         | 32.6s  | 27.6s  | -15%  |

That's the "Rel %" of "Before", that is delta/before, isn't it?

> To test this yourself, run the following on your repo:
>
>   git config core.commitGraph true
>   git show-ref -s | git commit-graph write --set-latest --stdin-commits
>
> The second command writes a commit graph file containing every commit
> reachable from your refs. Now, all git commands that walk commits will
> check your graph first before consulting the ODB. You can run your own
> performance comparisions by toggling the 'core.commitgraph' setting.

Good.  It is nicely similar to how bitmap indices (of reachability) are
handled.

I just wonder what happens in the (rare) presence of grafts (old
mechanism), or "git replace"-d commits...

Regards,
-- 
Jakub Narębski