Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic

Derrick Stolee <stolee@xxxxxxxxx> · Sun, 21 Oct 2018 21:51:10 -0400

On 10/21/2018 9:12 PM, Junio C Hamano wrote:
Jakub Narebski <jnareb@xxxxxxxxx> writes:

So if revs->limited is set (but not because revs->topo_order is set),
which means A..B queries, we will be still using the old algorithm.
All right, though I wonder if it could be improved in the future
(perhaps with the help of other graph labelling / indices than
generation numbers, maybe a positive-cut index).

Do you have an idea why there is no improvement with the new code in
this case?
I didn't get the impression that it would not be possible to improve
the "--topo A..B" case by using generation numbers from this series.
Isn't it just because the necessary code has not been written yet?
In addition to what is needed for "--topo P1 P2 P3..." (all
positive), limited walk needs to notice the bottom boundary and stop
traversal.  Having generation numbers would make it slightly easier
than without, as you know that a positive commit you have will not
be marked UNINTERESTING due to a negative commit whose ancestors
have not been explored, as long as that negative commit has a higher
generation number.  But you still need to adjust the traversal logic
to properly terminate upon hitting UNINTERESTING one, and also
propagate the bit down the history, which is not needed at all if
you only want to support the "positive only" case.

Actually, the code has been written, but the problem is the same as the 
performance issue when I made paint_down_to_common() use generation 
numbers: the algorithm for deciding what is in the set "reachable from A 
but not reachable from B" uses commit-date order as a heuristic to avoid 
walking the entire graph. Yes, the revision parameters specify "limited" 
in order to call "limit_list()", but it uses the same algorithm to 
determine the reachable set difference.

You can test this yourself! Run the following two commands in the Git 
repository using v2.19.1:

    time git log --topo-order -10 master >/dev/null

    time git log --topo-order -10 maint..master >/dev/null

I get 0.39s for the first call and 0.01s for the second. (Note: I 
specified "-10" to ensure we are only writing 10 commits and the output 
size does not factor into the time.) This is because the first walks the 
entire history, while the second uses the heuristic walk to identify a 
much smaller subgraph that the topo-order algorithm uses.

Just as before, by using this algorithm for the B..A case, we are adding 
an extra restriction on the algorithm: always be correct. This results 
in us walking a larger set (everything reachable from B or A with 
generation number at least the smallest generation of a commit reachable 
from only one).

I believe this can be handled by using a smarter generation number (one 
that relies on commit date as a heuristic, but still have enough 
information to guarantee topological relationships), and I've already 
started testing a few of these directions. It is possible now that we 
have concrete graph algorithms to use on real repositories. I hope to 
share a report on my findings in a couple weeks. I'll include how using 
this algorithm compares to the old algorithm in the B..A case.

Thanks,

-Stolee