Re: [PATCH v4 00/13] Serialized Git Commit Graph

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 2 Apr 2018 11:02:09 -0400

On 4/2/2018 10:46 AM, Jakub Narebski wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:
[...]
At one point, I was investigating these reachability indexes (I read
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
Ruan, Dey, and Xu [2]) but find the question that these indexes target
to be lacking for most of the Git uses. That is, they ask the boolean
question "Can X reach Y?". More often, Git needs to answer "What is
the set of commits reachable from X but not from Y" or "Topologically
sort commits reachable from X" or "How many commits are in each part
of the symmetric difference between reachable from X or reachable from
Y?"
In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).

I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario).

Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph.  That would help to answer
even more Git questions.

This two-dimensional generation number is helpful for non-reachability 
queries, but is something that needs the "full" commit graph in order to 
define the value for a single commit (hence the O(N lg N) performance 
mentioned below). Generation numbers are effective while being easy to 
compute and immutable.

I wonder if FELINE was compared directly to a one-dimensional index (I 
apologize that I have not read the paper in detail, so I don't 
understand the indexes they compare against). It also appears the graphs 
they use for their tests are not commit graphs, which have a different 
shape than many of the digraphs studies by that work.

This is all to say: I would love to see an interesting study in this 
direction, specifically comparing this series' definition of generation 
numbers to the 2-dimensional system in FELINE, and on a large sample of 
commit graphs available in open-source data sets (Linux kernel, Git, 
etc.) and possibly on interesting closed-source data sets.

The case for "Can X reach Y?" is mostly for commands like 'git branch
--contains', when 'git fetch' checks for forced-updates of branches,
or when the server decides enough negotiation has occurred during a
'git fetch'. While these may be worth investigating, they also benefit
greatly from the accelerated graph walk introduced in the current
format.

I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.

[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.

This would be very easy to add as an optional chunk, since it can use 
one row per commit.

Thanks,
-Stolee