Re: [RFC] Other chunks for commit-graph, part 1 - Bloom filters, topo order, etc.

Jakub Narebski <jnareb@xxxxxxxxx> · Mon, 14 May 2018 22:58:37 +0200

Derrick Stolee <stolee@xxxxxxxxx> writes:
> On 5/12/2018 10:00 AM, Jakub Narebski wrote:
>> Derrick Stolee <stolee@xxxxxxxxx> writes:
>>> On 5/4/2018 3:40 PM, Jakub Narebski wrote:
>>>>
>>>> With early parts of commit-graph feature (ds/commit-graph and
>>>> ds/lazy-load-trees) close to being merged into "master", see
>>>> https://public-inbox.org/git/xmqq4ljtz87g.fsf@xxxxxxxxxxxxxxxxxxxxxxxxx/
>>>> I think it would be good idea to think what other data could be added
>>>> there to make Git even faster.
>>> Before thinking about adding more data to the commit-graph, I think
>>> instead we need to finish taking advantage of the data that is already
>>> there. This means landing the generation number patch [1] (I think
>>> this is close, so I'll send a v6 this week if there is no new
>>> feedback.) and the auto-compute patch [2] (this could use more
>>> feedback, but I'll send a v1 based on the RFC feedback if no one
>>> chimes in).
>>>
>>> [1]
>>> https://public-inbox.org/git/20180501124652.155781-1-dstolee@xxxxxxxxxxxxx/
>>>      [PATCH v5 00/11] Compute and consume generation numbers
>>>
>>> [2]
>>> https://public-inbox.org/git/20180417181028.198397-1-dstolee@xxxxxxxxxxxxx/
>>>      [RFC PATCH 00/12] Integrate commit-graph into 'fsck' and 'gc'
>>
>> Right, so the RFC might be a bit premature; I wanted the discussion to
>> be out there to think about when adding new uses of existing features.
>>
>>
>> DIGRESSION: it is commendable that you are sending patches in small,
>> easy digestible chunks / patch series.  It is much easier to review 10+
>> series than 80+ behemoth (though I understand it is not always possible
>> to subdivide patch series into smaller self-contained sub-series).
>>
>> On the other hand, it would be nice to have some roadmap about series
>> and features to be sent in the future, if possible.  Something like what
>> was done when 'git rebase --interactive' was getting builtinified: moved
>> (in parts) to C.  It would be great to have such roadmap with milestones
>> achieved and milestones to be done in the cover letter for series.
>
> I suppose that is what I intended in the "Future Work" section of
> Documentation/technical/commit-graph.txt. It gives a set of things
> that need to be done in order to make this a default feature, not just
> an expert-level feature. When I wrote the section, I was focused
> entirely on "commit-graph version 1.0" and I didn't know how long that
> would take. The series have been getting interest and review (in great
> part to your interest, Jakub) so they have been moving to 'next'
> faster than I anticipated.
>
> I'll plan on writing a more detailed list of future directions, but
> for now I'll list the things I know about and how I see them in
> priority order:
>
> Commit-graph v1.0:
> * ds/generation-numbers
> * 'verify' and fsck/gc integration
> * correct behavior with shallow clones, replace-objects, and grafts

So the goal of current v1.0 phase is to introduce generation numbers.
use them for better performance ("low hanging fruit"), ensure that it is
automatic and safe -- thus useable for an ordinary user.

>
> Commit-graph v1.1:
> * Place commit-graph storage in the_repository
> * 'git tag --merged' use generation numbers
> * 'git log --graph' use generation numbers
>
> Commit-graph v1.X:
> * Protocol v2 verb for sending commit-graph
> * Bloom filters for changed paths

Thanks, that is what I was missing.  The "Future Work" section, while
very nice to have (because it does not require to follow git
development; it is here in technical documentation), lacked
prioritization and rough milestones map.

[...]
>>> The tougher challenge is `git log --graph`. The revision walk
>>> machinery currently uses two precompute phases before iterating
>>> results to the pager: limit_list() and sort_in_topological_order();
>>> these correspond to two phases of Kahn's algorithm for topo-sort
>>> (compute in-degrees, then walk by peeling commits with in-degree
>>> zero). This requires O(N) time, where N is the number of reachable
>>> commits. Instead, we could make this be O(W) time to output one page
>>> of results, where W is (roughly) the number of reachable commits with
>>> generation number above the last reported result.
>>
>> A reminder: Kahn's algorithm (described for example in [1] and [3])
>> looks like this:
>>
>>    L ← Empty list that will contain the sorted elements
>>    S ← Collection of all nodes with no incoming edge
>>    while S is non-empty do
>>        remove a node 'n' from S
>>        add 'n' to tail of L
>>        for each parent 'm' of 'n' do
>>            decrease the in-degree of 'm'
>>            if 'm' has in-degree of 0 then
>>                insert 'm' into S
>>
>> [1]: https://en.wikipedia.org/wiki/Topological_sorting#Kahn's_algorithm
>> [2]: https://www.geeksforgeeks.org/topological-sorting-indegree-based-solution/
[...]

> I'm not following your pseudocode very well,

Not surprising, I've lost myself in the middle of writing it...

>                                              so instead I'll provide a
> more concrete description of what I mentioned before:

Before we start with helper data structures, we should start with the
inputs.  For topological sort it is set of starting commits that define
the part of commit graph we are interested in; they may be some that are
redundant, but those that do not are assumed to have in-degree of 0.

> Here are some data structures.
>
  IDX : ???

> IDV : A dictionary storing the currently-computed in-degree value of a
> commit 'c' as 'IDV[c]'. Assume all commits start with in-degree 0.

Nitpick: I guess that it is really "with assumed in-degree of 0".

This would be stored using commit-slab, isn't it?

> IDQ : A queue, for storing the "in-degree walk". It is a priority
> queue ordered by generation number.

All right, here we use the fact that if we walk up to and including
generation number of commit, then its in-degree will be calculated
correctly, as any commit with generation number smaller than generation
number of given commit cannot reach given commit and change its
in-degree.

> TOQ : A queue, for storing the "topo-order walk". It is a priority
> queue ordered by "visit order" (see algorithm)

Note: max priority queue ordered by "visit order" is practically a
stack, though I guess we prefer priority queue here to use the same code
for '--date-order'.

>
> Here are some methods.
>
> AdvanceInDegreeWalk(IDQ, IDV, gen):

If I understand it correctly this subroutine walks those commits that do
not have correct in-degree calculated, and walks till all commits with
generation number greater or equal to 'gen' cutoff parameter are walked
and have correct in-degree.

Assumes that all commits in graph but not in IDX have their in-degree
calculated.

>     while !IDX.Empty && IDQ.Max >= gen:
>         c = IDX.Dequeue
>
>         for each parent p of c:

Here the ordering of parents walked does not matter (much), as
generation number is used as cutoff only.

>             IDV[p]++;
>             IDQ.Enqueue(p, generation(p))

All right, looks correct; with gen = 0 it probably reduces to the
current code that does in-degree calculation upfront.

>
> InitializeTopoOrder(R):
>     Initialize IDQ, IDV, TOQ
>
>     min_generation = GENERATION_NUMBER_INFINITY
>     visit_order = 0
>
>     for each commit c in R:
>         min_generation = min(min_generation, generation(c))
>         IDQ.Enqueue(c, generation(c))
>
>     AdvanceInDegreeWalk(IDQ, IDV, min_generation)

This means that we walk BFS-like until all starting points, that is all
commits in R, have their in-degree calculated.  I think we can avoid
that, but it may change the exact behavior of '--topo-order'.

Currently if one commit in R is from orphan branch (like 'todo' in
git.git), or has otherwise very low generation number, it would mean
that AdvanceInDegreeWalk would walk almost all commits.

>     for each commit c in R:
>         if IDV[c] == 0:
>             TOQ.Enqueue(c, visit_order++)

I think we may want to either start from commits in the order given on
command line (given by R), and not reverse order, or maybe even start
from commits with maximum generation number.

>
>     return (IDQ, IDV, TOQ, visit_order)

All right.

Here is my [allegedly] improved version, which assumes that we always
want to start from commit with maximum generation number (there may be
more than one such commit).

Let's add one more data structure:

  RRQ : A queue, for storing remaining commits from R (starting point).
  It is a priority queue ordered by generation number.

InitializeTopoOrder(R):
    Initialize IDQ, IDV, TOQ, RRQ

    max_generation = 0
    visit_order = 0

    for each commit c in R:
        max_generation = max(max_generation, generation(c))
        unless IsRedundantFilter(R / c, c): # optional
            RRQ.Enqueue(c, generation(c))

    while RRQ.Max == max_generation:
        c = RRQ.Dequeue()
        IDV[c] = 0  # commits with max gen have in-degree of 0
        IDQ.Enqueue(c, generation(c))

    # this would be almost no-op
    AdvanceInDegreeWalk(IDQ, IDV, max_generation)

    for each commit c in reversed R:
        if generation(c) == max_generation: # then IDV[c] == 0
            TOQ.Enqueue(c, visit_order++)

    return (IDQ, IDV, TOQ, RRQ, visit_order)

>
> GetNextInTopoOrder(IDQ, IDV, TOQ, ref visit_order):
>     if TOQ.Empty:
>         return null
>
>     c = TOQ.Dequeue()
>
>     for each parent p of c:
>         IDV[p]--
>         AdvanceInDegreeWalk(IDQ, IDV, generation(p))
>         if IDV[p] == 0:
>             TOQ.Enqueue(p, visit_order++)
>
>     return c

With the modification to InitializeTopoOrder it would be

GetNextInTopoOrder(IDQ, IDV, TOQ, RRQ, ref visit_order):
    if TOQ.Empty:
        return null

    c = TOQ.Dequeue()

    for each parent p of c, sorted by generation number:
        # some of maybe in-degree zero are now surely in-degree zero
        while RRQ.Max > generation(p):
           r = RRQ.Dequeue()
           IDQ.Enqueue(r, generation(r))

        AdvanceInDegreeWalk(IDQ, IDV, generation(p))
        IDV[p]--
        if IDV[p] == 0:
            TOQ.Enqueue(p, visit_order++)

    return c

> (I mention "ref visit_order" to be sure we are
> passing-by-reference. In a full implementation, the walk details would
> be grouped into a struct.)
>
> This algorithm is relatively simple, but the hard part is teasing the
> revision walk machinery to initialize the data by calling
> InitializeTopoOrder(R) and to call GetNextInTopoOrder() whenever it
> needs a new element of the walk. That is, it is hard to be sure we are
> not causing side-effects as we make that transformation.

Right.  I'll take a look at existing code to find out how involved that
would be.

It is a pity that we cannot simply turn in-degree calculation into
producer that calculates in-degree in generation number order, and spits
it to FIFO pipe / channel / supply / blocking queue, while topo-order
calculation reads from in-degree pipe, and spits output to pager...

I wonder if we could use Scientist tool to make sure that new and old
code give the same results.

[...]
>> I think that, beside writing patches for Git, exploring how various
>> pieces of data and various indexes affect walking commit graphs is also
>> important.  My explorations shown that, for example, that FELINE index
>> is not good fit for relatively "narrow" graphs of revision history.
>> Exploring this in Python / Jupyter is easier than trying to write a
>> exploratory patch for Git, in my opinion.  Just IMVHO.

Actually I should have said that FELINE index for commit graphs (that I
have checked) does not show improvement over simply using some
topological ordering as negative-cut filter.

>
> You are right. Ruling out possibilities is the best outcome these
> prototypes can have. Your work has saved someone a lot of time in
> investigating that direction.

Regards,
--
Jakub Narębski