Re: [PATCH 0/6] Compute and consume generation numbers

Jakub Narebski <jnareb@xxxxxxxxx> · Sat, 21 Apr 2018 22:44:57 +0200

Jakub Narebski <jnareb@xxxxxxxxx> writes:
> Derrick Stolee <stolee@xxxxxxxxx> writes:
>> On 4/11/2018 3:32 PM, Jakub Narebski wrote:
>
>>> What would you suggest as a good test that could imply performance? The
>>> Google Colab notebook linked to above includes a function to count
>>> number of commits (nodes / vertices in the commit graph) walked,
>>> currently in the worst case scenario.
>>
>> The two main questions to consider are:
>>
>> 1. Can X reach Y?
>
> That is easy to do.  The function generic_is_reachable() does
> that... though using direct translation of the pseudocode for
> "Algorithm 3: Reachable" from FELINE paper, which is recursive and
> doesn't check if vertex was already visited was not good idea for large
> graphs such as Linux kernel commit graph, oops.  That is why
> generic_is_reachable_large() was created.
[...]

>> And the thing to measure is a commit count. If possible, it would be
>> good to count commits walked (commits whose parent list is enumerated)
>> and commits inspected (commits that were listed as a parent of some
>> walked commit). Walked commits require a commit parse -- albeit from
>> the commit-graph instead of the ODB now -- while inspected commits
>> only check the in-memory cache.
[...]
>>
>> For git.git and Linux, I like to use the release tags as tests. They
>> provide a realistic view of the linear history, and maintenance
>> releases have their own history from the major releases.
>
> Hmmm... testing for v4.9-rc5..v4.9 in Linux kernel commit graphs, the
> FELINE index does not bring any improvements over using just level
> (generation number) filter.  But that may be caused by narrowing od
> commit DAG around releases.
>
> I try do do the same between commits in wide part, with many commits
> with the same level (same generation number) both for source and for
> target commit.  Though this may be unfair to level filter, though...
>
>
> Note however that FELINE index is not unabiguous, like generation
> numbers are (modulo decision whether to start at 0 or at 1); it depends
> on the topological ordering chosen for the X elements.

One can now test reachability on git.git repository; there is a form
where one can plug source and destination revisions at
https://colab.research.google.com/drive/1V-U7_slu5Z3s5iEEMFKhLXtaxSu5xyzg#scrollTo=svNUnSA9O_NK&line=2&uniqifier=1

I have tried the case that is quite unfair to the generation numbers
filter, namely the check between one of recent tags, and one commit that
shares generation number among largest number of other commits.

Here level = generation number-1 (as it starts at 0 for root commit, not
1).

The results are:
 * src = 468165c1d = v2.17.0
 * dst = 66d2e04ec = v2.0.5-5-g66d2e04ec

 * 468165c1d has level 18418 which it shares with 6 commits
 * 66d2e04ec has level 14776 which it shares with 93 commits
 * gen(468165c1d) - gen(66d2e04ec) = 3642

 algorithm  | access  | walk   | maxdepth | visited | level-f  | FELINE-f  |
 -----------+---------+--------+----------+---------+----------+-----------+
 naive      | 48865   | 39599  | 244      | 9200    |          |           |
 level      |  3086   |  2492  | 113      |  528    | 285      |           |
 FELINE     |   283   |   216  |  68      |    0    |          |  25       |
 lev+FELINE |   282   |   215  |  68      |    0    |   5      |  24       |
 -----------+---------+--------+----------+---------+----------+-----------+
 lev+FEL+mpi|    79   |    59  |  21      |    0    |   0      |   0       |

Here we have:
* 'naive' implementation means simple DFS walk, without any filters (cut-offs)
* 'level' means using levels / generation numbers based negative-cut filter
* 'FELINE' means using FELINE index based negative-cut filter
* 'lev+FELINE' means combining generation numbers filter with FELINE filter
* 'mpi' means min-post [smanning-tree] intervals for positive-cut filter;
  note that the code does not walk the path after cut, but it is easy to do

The stats have the following meaning:
* 'access' means accessing the node
* 'walk' is actual walking the node
* 'maxdepth' is maximum depth of the stack used for DFS
* 'level-f' and 'FELINE-f' is number of times levels filter or FELINE filter
  were used for negative-cut; note that those are not disjoint; node can
  be rejected by both level filter and FELINE filter

For v2.17.0 and v2.17.0-rc2 the numbers are much less in FELINE favor:
the results are the same, with 5 commits accessed and 6 walked compared
to 61574 accessed in naive algorithm.

The git.git commit graph has 53128 nodes and 66124 edges, 4 tips / heads
(different child-less commits) and 9 roots, and has average clustering
coefficient 0.000409217.

P.S. Would it be better to move the discussion about possible extensions
to the commit-graph in the form of new chunks (topological order, FELINE
index, min-post intervals, bloom filter for changed files, etc.) be
moved into separate thread?
-- 
Jakub Narębski