Re: [PATCH v4 4/7] revision.c: begin refactoring --topo-order logic

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Junio C Hamano <gitster@xxxxxxxxx> writes:

> Derrick Stolee <stolee@xxxxxxxxx> writes:
>
>>     time git log --topo-order -10 master >/dev/null
>>
>>     time git log --topo-order -10 maint..master >/dev/null
>>
>> I get 0.39s for the first call and 0.01s for the second. (Note: I
>> specified "-10" to ensure we are only writing 10 commits and the
>> output size does not factor into the time.) This is because the first
>> walks the entire history, while the second uses the heuristic walk to
>> identify a much smaller subgraph that the topo-order algorithm uses.
>
> The algorithm can be fooled by skewed timestamps (i.e. that SLOP
> thing tries to work around), but is helped by being able to leave
> early, and it will give us the correct answer as long as there is no
> timestamp inversion.
>
> But monotonically increasing "timestamp" without inversion is what
> we invented "generation numbers" for, no?  When there is no
> timestamp inversion, would a walk based on commit timestamps walk
> smaller set than a walk based on commit generation numbers?

The problem, as far as I understand it, is in B..A case, to be more
exact with the walk from the exclusion set.  There can be cases when
there are two paths from the commit in the exclusion set, and sorting by
generation number will always walk the longer one, while date-order
based heuristic walk traverses smaller subgraph.

This problem was described in "[PATCH 1/1] commit: don't use generation
numbers if not needed" [1]; relevant fragment cited below:

DS> For instance, computing the merge-base between consecutive versions of
DS> the Linux kernel has no effect for versions after v4.9, but 'git
DS> merge-base v4.8 v4.9' presents a performance regression [...]
DS> 
DS> The topology of this case can be described in a simplified way
DS> here:
DS> 
DS>   v4.9
DS>    |  \
DS>    |   \
DS>   v4.8  \
DS>    | \   \
DS>    |  \   |
DS>   ...  A  B
DS>    |  /  /
DS>    | /  /
DS>    |/__/
DS>    C
DS> 
DS> Here, the "..." means "a very long line of commits". By generation
DS> number, A and B have generation one more than C. However, A and B
DS> have commit date higher than most of the commits reachable from
DS> v4.8. When the walk reaches v4.8, we realize that it has PARENT1
DS> and PARENT2 flags, so everything it can reach is marked as STALE,
DS> including A. B has only the PARENT1 flag, so is not STALE.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_commit_date, A and B are removed from the queue
DS> early and C is inserted into the queue. At this point, C and the
DS> rest of the queue entries are marked as STALE. The loop then
DS> terminates.
DS> 
DS> When paint_down_to_common() is run using
DS> compare_commits_by_gen_then_commit_date, B is removed from the
DS> queue only after the many commits reachable from v4.8 are explored.
DS> This causes the loop to run longer. The reason for this regression
DS> is simple: the queue order is intended to not explore a commit
DS> until everything that _could_ reach that commit is explored. From
DS> the information gathered by the original ordering, we have no
DS> guarantee that there is not a commit D reachable from v4.8 that
DS> can also reach B. We gained absolute correctness in exchange for
DS> a performance regression.

[1]: https://public-inbox.org/git/efa3720fb40638e5d61c6130b55e3348d8e4339e.1535633886.git.gitgitgadget@xxxxxxxxx/T/#u

>> Just as before, by using this algorithm for the B..A case, we are
>> adding an extra restriction on the algorithm: always be correct. This
>> results in us walking a larger set (everything reachable from B or A
>> with generation number at least the smallest generation of a commit
>> reachable from only one).
>>
>> I believe this can be handled by using a smarter generation number
>> (one that relies on commit date as a heuristic, but still have enough
>> information to guarantee topological relationships), and I've already
>> started testing a few of these directions.
>
> Good to hear.

I'm not sure if you can enhance generation numbers, or rather using /
sorting of generation numbers in such way.

With generation numbers, if you have two possible paths to the merge
commit, and one is much longer than the other, then the commits on this
longer path will have larger generation numbers.


        0     1     2     3     4     5     6     7       = gen(c) - gen(B)
    --- B<----*<----*<----*<----*<----*<----*<----M ---
        ^                                        /
         \--------------------------x<--------x</         = gen(c) - gen(B)
                                    1         2

    ===================== commit date ======================>

Using priority-queue which selects max generation number would then
always walk the longer path (or walk longer path first).

This does not matter for the walk from the inclusive set (A in B..A), as
we want to walk all commits anyway; we walk both paths, it does not
matter which we walk first (it may change the unimportant parts of
topological sort, though).

For the walk from the exclusive set (B in B..A), any path is good; we
don't need and do not want to walk all paths.  Sorting by generation
numbers will always choose the longer path, while sorting by commit date
may walk the "shortcut" first, thus marking commits reachable from the
inclusive subset (from A) STALE earlier, thus earlier notice of
all-stale, and end of walk.

I think that in the case of walking from the exclusion subset, using
positive-cut reachbility index would be more useful than trying to come
up with "smarter generation number" (though I am not saying that it
would be impossible).  For example using reachability bitmap indices, or
post-order in spanning tree (the latter descibed in [2]) to mark more
commits stale than just parents, may help more.

Actually, this is something independent from "smarter generation
numbers", and can only help...

[2]: "[RFC] Other chunks for commit-graph, part 2 - reachability indexes"
     https://public-inbox.org/git/86muxcuyod.fsf@xxxxxxxxx/

-- 
Jakub Narębski




[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux