Re: [PATCH v4 00/13] Serialized Git Commit Graph

Jakub Narebski <jnareb@xxxxxxxxx> · Sun, 08 Apr 2018 00:37:12 +0200

Derrick Stolee <stolee@xxxxxxxxx> writes:

> On 4/2/2018 10:46 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@xxxxxxxxx> writes:
> [...]
>> I see the FELINE-index as a stronger form of generation numbers (called
>> also level of the vertex / node), in that it allows to negative-cut even
>> more, pruning paths that are known to be unreachable (or marking nodes
>> known to be unreachable in the "calculate difference" scenario).
>>
>> Also, FELINE-index uses two integer numbers (coordinates in 2d space);
>> one of those indices can be the topological numbering (topological
>> sorting order) of nodes in the commit graph.  That would help to answer
>> even more Git questions.
>
> This two-dimensional generation number is helpful for non-reachability
> queries, but is something that needs the "full" commit graph in order
> to define the value for a single commit (hence the O(N lg N)
> performance mentioned below). Generation numbers are effective while
> being easy to compute and immutable.

O(N log N) is the cost of sort algorithm, namely peforming topological
sorting of commits.  Which Git can currently do.

We can use the main idea behind FELINE index, namely weak dominance
drawing, while perhaps changing the detail.  The main idea is that if
you draw edges of DAG always to be in selected quadrant -- the default
is drawing them up and to the right, then paths would also always be in
the same quadrant; if target node is not in given quadrant with respect
to source node, then target node cannot be reached from source node.

For fast answering of reachability queries it is important to have
minimal number of false positives (falsely implied paths), that is
situation where FELINE index does imply that there could be a path, but
in fact target is not reachable from the source.  The authors propose a
heuristics: use positions in [some] topological order for X coordinates
of FELINE index, and generate new optimal locally topological sorting to
use for Y coordinates.

Generation numbers (which can be used together with FELINE index, and
what paper does use -- calling them Level Filter) have the advantage of
being able to be calculated incrementally. I think this is what you
wanted to say.

I think it would be possible to calculate FELINE index incrementally,
too.  If Git's topological order is stable, at least for X coordinates
of FELINE index it would be easy.

I have started experimenting with this approach in Jupyter Notebook,
available on Google Colaboratory (you can examine it, run it and edit it
from web browser -- the default is to use remote runtime on Google
cloud).  Currently I am at the stage of reproducing the theoretical
parts of the FELINE paper.

  https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
  https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

> I wonder if FELINE was compared directly to a one-dimensional index (I
> apologize that I have not read the paper in detail, so I don't
> understand the indexes they compare against).

They have compared FELINE using real graphs (like ArXiv, Pubmed,
CiteseerX, Uniprot150m) and synthetic DAG against:
 - GRAIL (Graph Reachability Indexing via RAndomized Interval Labeling)
 - FERRARI (Flexible and Efficient Reachability Range Assignment for
   gRapth Indexing), with interval set compression to approximate ones
 - Nuutila's INTERVAL, which extracts complete transitive closure of
   the graph and stores it using PWAH bit-vector compression
 - TF-Label (Topological Folding LABELling)

>                                                It also appears the
> graphs they use for their tests are not commit graphs, which have a
> different shape than many of the digraphs studies by that work.

I plan to check how FELINE index works for commit graphs in
above-mentioned Jupyter Notebook.

> This is all to say: I would love to see an interesting study in this
> direction, specifically comparing this series' definition of
> generation numbers to the 2-dimensional system in FELINE, and on a
> large sample of commit graphs available in open-source data sets
> (Linux kernel, Git, etc.) and possibly on interesting closed-source
> data sets.

I wonder if there are more recent works, with even faster solutions.

>>> The case for "Can X reach Y?" is mostly for commands like 'git branch
>>> --contains', when 'git fetch' checks for forced-updates of branches,
>>> or when the server decides enough negotiation has occurred during a
>>> 'git fetch'. While these may be worth investigating, they also benefit
>>> greatly from the accelerated graph walk introduced in the current
>>> format.
>>>
>>> I would be happy to review any effort to extend the commit-graph
>>> format to include such indexes, as long as the performance benefits
>>> outweigh the complexity to create them.
>>>
>>> [2] http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.719.8396&rep=rep1&type=pdf

I wonder if next low-hanging branch would be to store topological
ordering of commits.  It could be done, I think, with two chunks (or two
parts of one chunk): first to store position in topological order for
each commit (entries sorted by hash), second to store list of commits in
topological order (entries sorted by topological sort).

>>
>> The complexity of calculating FELINE index is O(|V| log(|V|) + |E|), the
>> storage complexity is 2*|V|.
>>
>
> This would be very easy to add as an optional chunk, since it can use
> one row per commit.

Right.

-- 
Jakub Narębski