Re: [PATCH v3 7/7] revision.c: refactor basic topo-order logic

Jeff King <peff@xxxxxxxx> · Thu, 11 Oct 2018 11:35:11 -0400

On Fri, Sep 21, 2018 at 10:39:36AM -0700, Derrick Stolee via GitGitGadget wrote:

> From: Derrick Stolee <dstolee@xxxxxxxxxxxxx>
> 
> When running a command like 'git rev-list --topo-order HEAD',
> Git performed the following steps:
> [...]
> In the new algorithm, these three steps correspond to three
> different commit walks. We run these walks simultaneously,

A minor nit, but this commit message doesn't mention the most basic
thing up front: that its main purpose is to introduce a new algorithm
for topo-order. ;)

It's obvious in the context of reviewing the series, but somebody
reading "git log" later may want a bit more. Perhaps:

  revision.c: implement generation-based topo-order algorithm

as a subject, and/or an introductory paragraph like:

  The current --topo-order algorithm requires walking all commits we
  are going to output up front, topo-sorting them, all before
  outputting the first value. This patch introduces a new algorithm
  which uses stored generation numbers to incrementally walk in
  topo-order, outputting commits as we go.

Other than that, I find this to be a wonderfully explanatory commit
message. :)

> The walks are as follows:
> 
> 1. EXPLORE: using the explore_queue priority queue (ordered by
>    maximizing the generation number), parse each reachable
>    commit until all commits in the queue have generation
>    number strictly lower than needed. During this walk, update
>    the UNINTERESTING flags as necessary.

OK, this makes sense. If we know that everybody else in our queue is at
generation X, then it is safe to output a commit at generation greater
than X.

I think this by itself would allow us to implement "show no parents
before all of its children are shown", right? But --topo-order promises
a bit more: "avoid showing commits no multiple lines of history
intermixed".

I guess also INFINITY generation numbers need more. For a real
generation number, we know that "gen(A) == gen(B)" implies that there is
no ancestry relationship between the two. But not so for INFINITY.

> 2. INDEGREE: using the indegree_queue priority queue (ordered
>    by maximizing the generation number), add one to the in-
>    degree of each parent for each commit that is walked. Since
>    we walk in order of decreasing generation number, we know
>    that discovering an in-degree value of 0 means the value for
>    that commit was not initialized, so should be initialized to
>    two. (Recall that in-degree value "1" is what we use to say a
>    commit is ready for output.) As we iterate the parents of a
>    commit during this walk, ensure the EXPLORE walk has walked
>    beyond their generation numbers.

I wondered how this would work for INFINITY. We can't know the order of
a bunch of INFINITY nodes at all, so we never know when their in-degree
values are "done". But if I understand the EXPLORE walk, we'd basically
walk all of INFINITY down to something with a real generation number. Is
that right?

But after that, I'm not totally clear on why we need this INDEGREE walk.

> 3. TOPO: using the topo_queue priority queue (ordered based on
>    the sort_order given, which could be commit-date, author-
>    date, or typical topo-order which treats the queue as a LIFO
>    stack), remove a commit from the queue and decrement the
>    in-degree of each parent. If a parent has an in-degree of
>    one, then we add it to the topo_queue. Before we decrement
>    the in-degree, however, ensure the INDEGREE walk has walked
>    beyond that generation number.

OK, this makes sense to make --author-date-order, etc, work. Potentially
those numbers might have no relationship at all with the graph
structure, but we promise "no parent before its children are shown", so
this is really just a tie-breaker after the topo-sort anyway. As long as
steps 1 and 2 are correct and produce a complete set of commits for one
"layer", this should be OK.

I guess I'm not 100% convinced that we don't have a case where we
haven't yet parsed or considered some commit that we know cannot have an
ancestry relationship with commits we are outputting. But it may have an
author-date-order relationship.

(I'm not at all convinced that this _is_ a problem, and I suspect it
isn't; I'm only suggesting I haven't fully grokked the proof).

> ---
>  object.h   |   4 +-
>  revision.c | 196 +++++++++++++++++++++++++++++++++++++++++++++++++++--
>  revision.h |   2 +
>  3 files changed, 194 insertions(+), 8 deletions(-)

I'll pause here on evaluating the actual code. It looks sane from a
cursory read, but there's no point in digging further until I'm sure I
fully understand the algorithm. I think that needs a little more brain
power from me, and hopefully discussion around my comments above will
help trigger that.

-Peff