On 4/2/2018 10:46 AM, Jakub Narebski wrote:
Derrick Stolee <stolee@xxxxxxxxx> writes:
[...]
At one point, I was investigating these reachability indexes (I read
"SCARAB: Scaling Reachability Computation on Large Graphs" by Jihn,
Ruan, Dey, and Xu [2]) but find the question that these indexes target
to be lacking for most of the Git uses. That is, they ask the boolean
question "Can X reach Y?". More often, Git needs to answer "What is
the set of commits reachable from X but not from Y" or "Topologically
sort commits reachable from X" or "How many commits are in each part
of the symmetric difference between reachable from X or reachable from
Y?"
In the "Reachability Queries in Very Large Graphs..." by Veloso, Cerf,
Meira and Zaki FELINE-index work, authors mention SCARAB as something
that can be used in addition to FELINE-index, as a complementary data
(FELINE-SCARAB in the work, section 4.4).
I see the FELINE-index as a stronger form of generation numbers (called
also level of the vertex / node), in that it allows to negative-cut even
more, pruning paths that are known to be unreachable (or marking nodes
known to be unreachable in the "calculate difference" scenario).
Also, FELINE-index uses two integer numbers (coordinates in 2d space);
one of those indices can be the topological numbering (topological
sorting order) of nodes in the commit graph. That would help to answer
even more Git questions.
This two-dimensional generation number is helpful for non-reachability
queries, but is something that needs the "full" commit graph in order to
define the value for a single commit (hence the O(N lg N) performance
mentioned below). Generation numbers are effective while being easy to
compute and immutable.
I wonder if FELINE was compared directly to a one-dimensional index (I
apologize that I have not read the paper in detail, so I don't
understand the indexes they compare against). It also appears the graphs
they use for their tests are not commit graphs, which have a different
shape than many of the digraphs studies by that work.
This is all to say: I would love to see an interesting study in this
direction, specifically comparing this series' definition of generation
numbers to the 2-dimensional system in FELINE, and on a large sample of
commit graphs available in open-source data sets (Linux kernel, Git,
etc.) and possibly on interesting closed-source data sets.
The case for "Can X reach Y?" is mostly for commands like 'git branch
--contains', when 'git fetch' checks for forced-updates of branches,
or when the server decides enough negotiation has occurred during a
'git fetch'. While these may be worth investigating, they also benefit
greatly from the accelerated graph walk introduced in the current
format.
I would be happy to review any effort to extend the commit-graph
format to include such indexes, as long as the performance benefits
outweigh the complexity to create them.
[2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf
The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
storage complexity is 2*|V|.
This would be very easy to add as an optional chunk, since it can use
one row per commit.
Thanks,
-Stolee