Re: [PATCH 0/6] Compute and consume generation numbers

Derrick Stolee <stolee@xxxxxxxxx> · Mon, 23 Apr 2018 09:54:56 -0400

On 4/21/2018 4:44 PM, Jakub Narebski wrote:
Jakub Narebski <jnareb@xxxxxxxxx> writes:
Derrick Stolee <stolee@xxxxxxxxx> writes:
On 4/11/2018 3:32 PM, Jakub Narebski wrote:
What would you suggest as a good test that could imply performance? The
Google Colab notebook linked to above includes a function to count
number of commits (nodes / vertices in the commit graph) walked,
currently in the worst case scenario.
The two main questions to consider are:

1. Can X reach Y?
That is easy to do.  The function generic_is_reachable() does
that... though using direct translation of the pseudocode for
"Algorithm 3: Reachable" from FELINE paper, which is recursive and
doesn't check if vertex was already visited was not good idea for large
graphs such as Linux kernel commit graph, oops.  That is why
generic_is_reachable_large() was created.
[...]

And the thing to measure is a commit count. If possible, it would be
good to count commits walked (commits whose parent list is enumerated)
and commits inspected (commits that were listed as a parent of some
walked commit). Walked commits require a commit parse -- albeit from
the commit-graph instead of the ODB now -- while inspected commits
only check the in-memory cache.
[...]
For git.git and Linux, I like to use the release tags as tests. They
provide a realistic view of the linear history, and maintenance
releases have their own history from the major releases.
Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
FELINE index does not bring any improvements over using just level
(generation number) filter.  But that may be caused by narrowing od
commit DAG around releases.

I try do do the same between commits in wide part, with many commits
with the same level (same generation number) both for source and for
target commit.  Though this may be unfair to level filter, though...

Note however that FELINE index is not unabiguous, like generation
numbers are (modulo decision whether to start at 0 or at 1); it depends
on the topological ordering chosen for the X elements.
One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
  * src = 468165c1d = v2.17.0
  * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

  * 468165c1d has level 18418 which it shares with 6 commits
  * 66d2e04ec has level 14776 which it shares with 93 commits
  * gen(468165c1d) - gen(66d2e04ec) = 3642

  algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
  -----------+---------+--------+----------+---------+----------+-----------+
  naive      | 48865   | 39599  | 244      | 9200    |          |           |
  level      |  3086   |  2492  | 113      |  528    | 285      |           |
  FELINE     |   283   |   216  |  68      |    0    |          |  25       |
  lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
  -----------+---------+--------+----------+---------+----------+-----------+
  lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
   note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
   were used for negative-cut; note that those are not disjoint; node can
   be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.

Thanks for these results. Now, write a patch. I'm sticking to generation 
numbers for my patch because of the simplified computation, but you can 
contribute a FELINE implementation.

P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post intervals, bloom filter for changed files, etc.) be
moved into separate thread?

Yes. I think we've exhausted this thought experiment and future 
discussion should revolve around actual implementations in Git with 
end-to-end performance times. The computation time for computing the 
FELINE index should be included in that discussion.

Thanks,
-Stolee