[RFC] Other chunks for commit-graph, part 2 - reachability indexes

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

In previous email I wrote:

JN> As this post is getting long, I'll post other ideas, about commit
JN> labeling for faster reachability queries in a separate email.

This is this email.


Here is second part of series dedicated to discussing what other data,
like various reachability indexes / labeling could be added to the
commit-graph file (as new chunks) to make Git even faster.

By reachability I mean answering the question whether commit A can reach
commit B, or in other words if commit B is an ancestor of commit A.
This type of query is used in many Git operations.

The commit-graph has one such reachability index built-in already, in
the form of generation numbers; this index is called level or
topological level of node / vertex in the literature / articles about
graph algorithms.

> 2. The generation number of the commit. Commits with no parents have
>    generation number 1; commits with parents have generation number one
>    more than the maximum generation number of its parents.

I have played a bit with various reachability indexes, starting with the
ones described in "Reachability Queries in Very Large Graphs: A Fast
Refined Online Search Approach" (FELINE index) paper by Renê R. Veloso,
Loïc Cerf, Wagner Meira Jr and Mohammed J. Zaki (2014), available in the
PDF form at https://openproceedings.org/EDBT/2014/paper_166.pdf

I have started working on Jupyter Notebook on Google Colaboratory to
find out how much speedup we can get using those indexes for Git
large-ish commit graphs (which turns out to be quite specific type of
graph, more about which later), namely git.git and Linux kernel ones.

The Jupyter Notebook (which runs on Google cloud, but can be also run
locally) uses Python kernel, NetworkX librabry for graph manipulation,
and matplotlib (via NetworkX) for display.

Available at:
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg
https://drive.google.com/file/d/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg/view?usp=sharing

At current size it started to be a bit unwieldy, especially recomputing
it from the start, so I am thinking about splitting it up and moving it
to GitHub; then I can e.g. put code and data in separate files, so they
do not have to be recalculated (cloning full Linux repository takes
quite a bit of time).  I have started the process: first step which is
copy of Colaboratory notebook is now available as the following repo:
https://github.com/jnareb/git-commit-graph-ext
https://github.com/jnareb/git-commit-graph-ext/blob/master/Graphs_FELINE_index.ipynb


Let's start with the reminder from part 1:

> Requirements and recommendations about possible new chunks
> ==========================================================
>
> Because the main goal of commit-graph feature is better performance in
> large repositories, any proposed new chunks (or, at higher level, every
> proposed piece of new data) needs to have the following properties.
>
>
> 1. First, it needs to have bounded and sensible size.  This means that
> the size of new proposed chunk should be constant, proportional to the
> number of commits, or at worst proportional to the number of edges.
>
> From the existing chunks, OIDF (OID Fanout) has constant size, both OIDL
> (OID Lookup) and CGET/CDAT (Commit Data) has size proportional to the
> number of commits, while not always present EDGE (Large Edge List) has
> size proportional to the number of "excess" edges in "huge vertices"
> (octopus merges).
>
>
> 2. Second, we want fast access, in most cases fast random access.  This
> means using fixed-size records.  All currently existing chunks have
> records (columns) of fixed and aligned size.
>
> This restriction means that idexes where we have variable amount of data
> per vertex (per commit) are discouraged.
>
>
> 3. Third, it needs to be reasonably fast to create, and fast to update.
> This means time to create the chunk to be proportional to number of
> commits, or sum of number of commits and edges (which for commit graph
> and other sparse graphs is proprtional to the number of nodes / commits
> anyway).  In my opinion time proportional to n*lug(n), where 'n' is the
> number of commits, is also acceptable.  Times proportional to n^2 or n^3
> are not acceptable.
>
> It is also strongly preferred that time to update the chunk is
> proportional to the number of new commits, so that incremental
> (append-only) update is possible.  The generation numbers index has this
> property.

Though as Ævar said in response, which I agree with:

ÆB> If hypothetically there was some data that significantly sped up Git
ÆB> and the algorithm to generate it was ridiculously expensive, that
ÆB> would be fine when combined with the ability to fetch it
ÆB> pre-generated from the server. There could always be an option where
ÆB> you trust the server and optionally don't verify the data, or where
ÆB> it's much cheaper to verify than compute.

https://public-inbox.org/git/86h8nm2pv8.fsf@xxxxxxxxx/T/#mdbc6a4ef052ae95136faf6d243d0c29ddfac58a8


Types of algorithms to speed up reachability queries
====================================================

The [research] issue for answering reachability queries is to find a
tradeoff between the online processing cost and the offline processing
cost.  All different approaches aim to find a trade-off among three main
costs: (a) query time which is the time to answer a reachability query
online [which for Git would be time to perform operation such as `git
branch --contains`], (b) construction time which is the time to
construct an index offline [which for Git would be time for client or
server to create commit-graph file data], and (c) the index size which
is the space needed to maintain the index [which for Git would be the
size of commit-graph file].

We can classify existing works / existing algorithms into two
categories.  The two main approaches are Label-Only and Label+G.  The
Label-Only approach is to answer reachability queries using the labels
computed only (this includes transitive closure compression and hop
labeling / transitive closure factorization methods).  The Label+G
approach is to make use of the labels computed as much as possible, but
it may need to traverse the graph G (it also called refined online
search).  The main idea behind Label+G is to reduce the time to
construct an effective index, since the computing cost to create
effective Label-Only index can be very high.

Note: this categorization is taken from the IP paper.

Beside those two groups there also exists so called boosting frameworks.
This algorithms try to reduce answering reachability query on very large
graph to answering reachability queries on smaller graph (reachability
backbone in case of SCARAB framework, reduced/simplified graph in case
of DAG Reduction method).

Let's table boosting frameworks for later (for example to examine
whether SCARAB could be used to choose which commits should have
reachability bitmaps).  If we follow "Requirements and recommendations
about possible new chunks", then we can limit examinations to Label+G
approaches.  Besides, commit-graphs can get large enough so that
Label-Only approaches starts to get unfeasible: index creation time and
index size gets very large.


Commit graph characteristics
============================

But first lets try to address the elephant in the room: would those
algorithms work well for the graph of revisions?

These methods are usually evaluated on a variety of real graph datasets,
both small and large, as well as large synthetic ones.  The real graphs
datasets include XML documents, genome databases, citation networks,
semantic knowledge databases and ontology terms, taxonomies, social
networks and web graphs.

It turns out that graph of revisions has size (number of nodes) that for
large repositories number of commits/nodes (50k for git.git, 750k for
Linux kernel, 1.7M for MS Windows) falls somewhere in the middle between
size of small graphs with 6k to 40k nodes and size of large graphs, most
of which has number of nodes staring from 1.5M and bigger.

Many of those example graphs (that are used for evaluation of
reachability algorithms) are approximately scale-free, with relative
commonness of vertices with a degree that greatly exceeds the average --
which is not the case for graph of revisions.  Some approaches include
special handling for those vertices to speed up queries; this probably
is not needed for DAG of commits, and I will skip these parts for now.

Commit graphs are sparse graphs, with average number of parents (the
ratio of edges to nodes in graph) below 2.0; even closer to 1.

One major difference is that in those graphs the percentage of
reachability queries that are answered negatively over all possible
reachability queries is over 90%.  On the other hand, the ratio of
reachability queries that are answered positively over all possible
reachability queries is very small; this ratio is called
reachability-ratio (or simply R-ratio), usually denoted as 'r'.  For
those real graph datasets, the R-ratio is between 2.1e-1 to 1.3e-7.  But
for graph of revision (based on sample of one) it looks like this
R-ratio is around 4.5e-1 (with, if I am not mistaken, maximum R-ratio
possible in DAGs being 0.5 or 5.0e-1, that is 50%).  This may affect the
evaluation of method for commit graphs; on the other hand as positive
queries are more interesting, some works do evaluations separately in
positive and negative queries, or on equal mixure of positive and
negative queries.

Another issue is that many of those "evaluation" graphs usually have the
large number of sources (nodes with only outgoing edges, "roots" in
graph terminology, "tips" or "heads" in Git terminology -- childless
nodes) or/and the large number of sinks (nodes with no outgoing edges,
"leafs" in graph terminology, "roots" in Git terminology -- parentless
nodes).  This seems to matter for FELINE index, which as my experiments
in mentioned Jupyter notebook shows is much less efective on commit
graphs than indicated in original work.  Note that while live
repositories would have much larger amount of branches than public
repositories (that I have used for tests) where there are only a few
long-lived public branches, even for Windows repo it is small percentage
of total number of nodes / commits (the Windows repo according to the
Git Merge 2018 presentation by Derrick has 90k branches and 1700k
commits; it is not as where 1/4 of nodes were source nodes).


Negative and positive-cut filters, false negatives and false positives
======================================================================

In Label+Graph approaches, indexing (node/edge labels) is enough to
answer some queries in constant time.

If the indexing can answer positive queries in constant time we call it
_positive-cut filter_.  This means that querying the index can state
that node B is reachable from node A (positive result); reaachability is
guaranteed.  However if we don't get such answer from the index, we
don't know if B is reachable from A, or if B is not reachable from A.
The case where B is reachable from A, but the index doesn't say it, we
can call _false negatives_: we get negative answer from the index, while
the actual ressult of reachability query is positive.

If the indexing can answer negative queries in constant time we call it
_negative-cut filter_.  This means that querying the index can state
that node B is definitevly not reachable from node A (negative result).
Otherwise we don't know if B is or is not reachable from A.  The case
where there is no path from A to B, but the index doesn't say it, we can
call _false positives_ or _falsely-implied path_.

Generation numbers (or topological levels) are an example of negative
cut filter.  If gen(A) < gen(B) then A cannot reach B.

Note: The terminology used here is taken from the FELINE paper (which
does not necesarily mean that it originated here; it appears in earlier
GRAIL paper too).

When one wants just to find if there exist path from A to B, both
positive-cut and negative-cut filters can be used.

  Data: (u, v) ∈ V^2, two vertices of the DAG
  Result: r(u, v), whether v is reachable from u
  Note: simplified DFS algorithm, no visited filter
  begin
      if positive-cut(u, v) then
          return true;
      if u == v then
          return true;
      if not negative-cut(u, v) then
          forall the w ∈ parents(u) do
              if Reachable(w, v) then
                  return true;

      return false;

This is the case for '--contains' and '--merged' queries in Git.

When one wants to find a single path from A to B if it exists, then both
negative-cut and most positive-cut filters can be used.

When one wants to find out all paths from A to B (e.g. for `git log A ^B
--strict-ancestor`) then in most cases only negative-cut filters are
really useful.


Online search algorithms
========================

When the index or indexes used cannot answer reachability query by
itself, one needs to perform a search from the source node to the target
node, applying positive-cut and negative-cut filters to reduce search
space.

One possible solution, often used for rechability testing in various
works, is to perform depth-first search (DFS).  This can be done simply
by using recursion (like in the above pseudo-code), but for large graphs
one need to use iterative approach by employing stack.  DFS is needed to
compute some of the reachability indexes.

However DFS it is not always the best strategy. Breadth-first search
(BFS) guarantees to find the shortest path with the cost of using more
memory.  In average, BFS is expected to be faster considering the cases
where the target node is close to the source node that has a large
reachability set.  This is especially true for graphs of revisions, that
are deep but narrow (their width is much smaller than their depth;
maximum number of commits with the same generation number is much
smaller than maximum generation number).  In BFS we store nodes to be
walked in a queue.

Besides the standard graph search strategies of DFS and BFS, sometimes
there is also considered bidirectional breadth-first search (BBFS),
which has a few advantages over BFS. In bidirectional search two BFS are
started simultaneously utilizing separate queues: one starting from
source node and going forward, and one starting from target node and
going backward (walking the reverse of DAG).  The first advantage of
bidirectional search is that it can answer positive queries faster when
the average degree of the graph is high. In negative queries, the worst
case scenario is that bidirectional search takes twice as much time as a
BFS would take; however there are substantial amount of cases where
bidirectional search is much faster than BFS.

Some indexes are geared towards making BBFS faster; this needs to be
taken into account when evaluating them, as we would not want to have to
deal with reversed commit graph.

Git actually uses a variant of BFS in paint_down_to_commin() in
commit.c, where it uses priority queue, sorted by generation number then
by date.  We can say that it is closest-first search variant of BFS.


Reachability indexes
====================

The Label+Graph approaches to include the following algorithms to speed
up reachability queries:

- Tree+SPPI (2005)
  (Surrogate & Surplus Predecessor Index)
- GRIPP (2007)
  (GRaph Indexing based on Pre- and Postorder numbering)
+ GRAIL (2010 & 2012)
  (Graph Reachability indexing via rAndomized Interval Labeling)
* PReaCH (2014)
  (Pruned Reachability Contraction Hierarchies)
+ FERRARI (2013)
  (Flexible and Efficient Reachability Range Assignment for gRaph Indexing)
+ FELINE (2014)
  (Fast rEfined onLINE search)
+ IP (2014)
  (Independent Permutation labeling approach)
+ BFL (2016)
  (Bloom Filter Labeling)

Algorithms prefixed with '-' are those older ones usually compared
against in different works that I won't cover here while algorithms
prefixed with '+' are those that I will cover somewhat.  Algorithms
prefixed with '*' are less known works that I think might be
interesting.


Levels / generation numbers
---------------------------

One of the simplest negative-cut filters, used by almost all works as a
supplementary filter... and also by Git's commit-graph feature.  The
generation number, or [topological] level is defined as length of
shortest path from a sink node ("root" in Git terminology).  You can also
define "reverse" or "backward" levels starting from source nodes, and
some works use that.

More useful in actually computing generation number for a commit, or
level of a node is recursive definition given at the beginning of this
email.  Note that usually nodes without outgoing edges ("root" nodes)
are assigned level zero, but giving it value of one like commit-graph
feature does is also valid convention.

The recursive definition shows that it is possible to calculate
generation numbers incrementally.  In particular the cost of updating
generation numbers is proportional to the number of new commits.  This
is not the case for "reverse" level numbers -- we won't consider it
then.

Generation numbers / topological levels are denoted using gen(B) symbol
in commit-graph documentation and commit messages, and l_u or L(u) in
works about graph reachability algorithms.  Let's use gen(A) here.

The following holds true:

  If A can reach B, and A != B, then gen(A) > gen(B)

This means that we get the following negative cut condition (this
condition is equivalent to the one above):

  If gen(A) <= gen(B) and A != B, then A cannot reach B

Sometimes it is easier to make use of the weaker negative-cut condition:

  If gen(A) < gen(B), then A cannot reach B


Topological ordering
--------------------

A topological sort or topological ordering of a directed acyclic graph
is a linear ordering of its vertices such that for every directed edge
(u,v) from vertex u to vertex v, u comes before v in the ordering.  Any
DAG has at least one topological ordering, and algorithms are known for
constructing a topological ordering of any DAG in linear time.

A topological ordering should be able to be updated quite easily, while
the reverse topological ordering should be able to be updated in time
proportional to the number of added commits (though there are no
guarantees about properties of such topological order).

The following holds true:

  If A can reach B, and A != B, then topo(A) > topo(B)
  If A can reach B, then topo(A) >= topo(B)

Because this is ordering, we also have that if topo(A) <= topo(B) and
topo(A) >= topo(B), then A = B; if topo(A) = topo(B) then A = B.

This means that we have the following negative-cut condition:

  If topo(A) < topo(B), then A cannot reach B

The condition is reversed for the reverse topological ordering.

Note: The FELINE index can be thought of as composed of two topological
ordering, where the second ordering is constructed using heuristic
algorithm in such way that it tries to minimize number of false
positives (falsely-implied paths) for the combination of those two
topological sort negative-cut filters.


Post-order in spanning tree
---------------------------

Directed rooted tree is a directed graph in which, for a vertex u called
the [tree] root and any other vertex v, there is exactly one directed
path from u to v.  A spannig tree of a directed acyclic graph G is a
subgraph that is a tree or set of trees (directed rooted tree for DAG)
which includes all of the vertices of graph G.  Source nodes of G are
roots of spanning trees of G.

The reachability problem on trees can be solved effectively by interval
labeling, which takes linear time and space for constructing the index,
and provides constant time querying.  Given a tree/forest T, interval
labeling assigns each node a label I(u) = [s_u, e_u] which represents
interval starting at s_u and ending at e_u.  A desired labeling
suitable for answering reachability queries in trees must satisfy the
condition that I(u) subsumes I(v), or in other words interval I(v) is a
subset of interval I(v) if and only if u reaches v.

One such labeling is _min-post labeling_ method for trees.  In this
labeling e_u = post(u) is the post-order value of the node u, defined as
the rank of the node u in a post order traversal of the tree.  The
starting value of the interval s_u is defined as the lowest post-order
rank for any node x in the subtree rooted at u (i.e., including u).
Note that because of how this interval was defined the check
I(v) ⊆ I(u) can be replaced by simpler post(v) ∈ I(u).

Pre order traversal of tree gives _pre-max labeling_ method for trees,
exactly equivalent to the min-post labeling described above.

This method used for a spanning tree gives positive-cut filter: if it
gives positive results for u ---> v query, then node v is reachable from
u via spanning tree.  If the answer from spanning tree labeling is
negative, node v can still be reachable from u, but via non-tree edge.
This is the idea behind positive-cut filter in GRAIL method, and this is
also positive-cut filter used in FELINE work/paper.  GRAIL calls this
positive-cut interval an interior interval.


The post-order or pre-order labeling of spanning tree is also labeling
of nodes in the original DAG.  Hence for negative-cut filter we can
assign to each node u the interval starting at minimum of labels of
nodes reachable in the DAG from u, and ending at maximum of labels of
nodes reachable in the DAG from u.  Then if node v has label outside
this interval I'(u), then u cannot reach v.  GRAIL calls this
negative-cut interval an exterior interval.

Now if the traversal of tree was done using DFS (depth-first search),
then the ending of min-post interval I(u) for a spanning tree is also
end of interval I'(u) for DAG.  Moreover the minimum of post(v) for
nodes reachable from u, that is the starting point for the DAG interval,
can be calculated iteratively during post-order traversal of the tree.
(It is possible that DFS is not needed; PReaCH work proves it for DFS,
but I have not examined proof in GRAIL work in detail).  GRAIL approach
uses those two intervals together: the interval label for node u is
given as I(u) = [s_u, m_u, e_u ], where EI(u) = [s_u, e_u] and
II(u) = [m_u, e_u] are the exterior (negative-cut) and interior
(positive-cut) intervals respectively.

The motivating idea in GRAIL is to use interval labeling multiple times,
for multiple random spaanning trees to improve searching phase
(answering reachability queries).  The GRAIL work proposes several
heuristics how to choose best set of spanning trees.  The recommended
number of those trees for large graphs is k = 5.

Using multiple spanning trees is not the only solution.  Alternative is
trying to get more out of single DFS numbering.  PReaCH work
(arXiv:1404.4465) adds one more possible and possibly large positive-cut
interval (called "full" interval there), and possibly one more
negative-cut interval (called "empty" interval).  Note that the original
work uses pre-order numbering and pre-max intervals, but the result can
be easily translated to post-numbering and min-post intervals (I have
done this in mentioned Jupyter notebook).

The additional positive-cut interval for node u, if it exists, is the
positive-cut interval for a node w = p_tree(u) that is reachable from u
and not in positive-cut interval of u, and has maximum size of said
positive-cut interval.  This can be computed iteratively, while
computing post-number ordering on u.  This seems promising for graph of
revisions.

One can also find additional negative-cut interval, finding f_gap so
that [f_gap, min] is "empty" interval, and the negative cut is split
from [f_min, post] to two intervals: [f_min, f_gap] and [min, post].
This also can be computed iteratively.


The FERRARI approach is more generic.  It starts with the fact that the
reachable set R(v) of a vertex v in the DAG is only partly represented
by the spanning tree interval I_T(v), as the tree interval only accounts
for reachability relationships that are preserved in tree T.  The
reachable set R(v) can be represented as set of intervals II(v).  The
resulting index (collection of reachable interval sets) can be regarded
as a materialization of the transitive closure of the graph, which gets
too large and is thus infeasible for large graphs.

Instead of storing full set of intervals II(v) representing reachability
set R(v), FERRARI index assigns a mixture of exact and approximate
reachability intervals, given a user-specified size constraint on the
index.  In FERRARI-L (local version) it is replaced by at most k
intervals, some of which come from II(v) and some are merges of
intervals from II(v).  If node w is in one of exact intervals for node
v, then v can reach w; if node w is outside all of intervals for node v,
then v cannot reach w.  Intervals are chosen in such way as to maximize
the sum of sizes of exact intervals, maximizing positive-cut
effectiveness.

In FERRARI-G variant the user-specified size constraint is made global;
some nodes may use more intervals, and some less, keeping the average
constant.  This allows to use index space more effectively, at the cost
of variable-length index and slower access.


Contraction hierarchies
-----------------------

In general, a Reachability Contraction Hierarchy (RCH) can be defined
based on any numbering order. We successively contract nodes with
increasing order(v). Contracting node v means removing it from the graph
and possibly inserting new edges such that the reachability relation in
the new graph is the same as in the previous graph, The query algorithm
does not use the contracted versions of the graph but only the ordering
and the shortcuts. This algorithm is based on the observation that in an
RCH it suffices to look at "up-down" paths, with contraction order first
increasing, then decreasing; this works best with bidirectional breadth
first search (BBFS).

PReaCH approach exploits the fact that any DAG contains source nodes
(with in-degree zero) and sink nodes (with out-degree zero). Contracting
such a node v never requires the insertion of shortcuts.

The index in this case is just one number per node: the contraction
order.  I'm not sure how useful that would be for BFS search, without
accessing reverse graph.


Graph dominance drawing
-----------------------

The fundamental idea behind FELINE is to associate every vertex u in V
with a unique ordered pair of natural integers i(u), so that if vertex u
can reach v, then i(u) ≦ u(v), that is first element is less or equal
first element in pair, and second element is less or equal second
element in pair.  If this pair of integers (x, y) associated with a
vertex is understood as coordinates in the Cartesian plane, then the
implication means that paths always point right and up, and it is enough
to check uper-right quadrant.

The method is inspired by weak dominance drawing, which guarantees the
implication for edges.  FELINE work shows proof that this leads to the
implication for paths.

Such pair of numbers are negative-cut filter.  The algorithm for
creating such index in FELINE work uses two topological orderings, where
the second is choosen to minimize number of false positives for the
index.

It turns out however from the experiments in mentoned Jupyter notebook
that this method doesn't work that well with commit graphs.  When there
are only a few sources and only a few sinks, then node positions gather
around 'x = y' diagonal line, thus FELINE index check almost reduces to
the topological ordering filter mentioned earlier.


Approximate set inclusion
-------------------------

Both IP and BFL takes reachability testing from a vertex u to a vertex v
as set-containment testing. Intuitively, if a vertex u can reach a
vertex v, then u must be able to reach all the vertices that v can
reach. Let Out(x) denote the set of all vertices that a vertex x can
reach in a graph. If u can reach v, then Out(v) ⊆ Out(u).  This is done
by checking by checking Out(v) ⊈ Out(u) instead (that Out(v) is not
subset or equal to Out(u), which is to check if there is at least one
element in Out(v) that is not in Out(u).

Both instead of getting an exact answer for testing, they get an answer
with given probability guarantee in a negative-cut fashion.  The IP
label uses k-min-wise independent permutations, while BFL uses Bloom
filters.

Those approaches work well with graphs that have not large reachability
sets (short depth), and where ratio of positive queries to all queries
is very small.  This means that they would most probably won't work well
for commit graphs, where neither holds true.


Summary
=======

The min-post labeling (the base of GRAIL approach) looks promising; it
needs to be checked whether PReaCH approach or FERRARI approach be
better for commit graphs.

I hope that this somewhat long post was informative,
-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux