Re: [bug] "git bisect old v3.0" takes 21 mins on Linux repo

Junio C Hamano <gitster@xxxxxxxxx> · Fri, 17 Jan 2025 08:50:57 -0800

Jeff King <peff@xxxxxxxx> writes:

> Both of those are trading a bit of accuracy in finding the exact
> midpoint in the early steps. It's perhaps another possible option for
> git-bisect itself: if we see a very large number of commits, we could
> try to approximate rather than finding the exact answer.

Another thing the user may (but "bisect" itself cannot) try is to
use a path-limited bisection (that is, if you know your breakage is
inside one subsytem, you only check commits that touch the area).

> In most
> histories I'd expect that taking the midpoint of a linearized topo-order
> would get you a pretty reasonable outcome. E.g.:
>
>   total=$(git rev-list --count v3.0..v6.13-rc7)
>   git rev-list --topo-order v3.0..v6.13-rc7 |
>   tail -n +$((total / 2)) | head -n 1
>
> runs in about 2s on my machine. The commit it finds, ed194d136769,
> is pretty close to the middle:
>
>   $ git rev-list --count v3.0..ed194d136769
>   526863
>   $ git rev-list --count ed194d136769..v6.13-rc7
>   543312

Interesting thought.

When I did the "single strand of pearls" optimization, I recall I
punted and said "we need to count the weight for all merges the
honest way".

One thing we may want to try is *not* to do the count_distance() for
all merges.  For example, if we have 1000 commits in the range,
first you pick a merge M among them and count how many commits in
the range it can reach.  Let's say it reachs 400 commits.

We are trying to find a commit that can reach as close to 500
commits, and we know any ancestor of M would reach fewer than 400
commits, so we know the score they will get would be worse than M
without running count_distance() on them.  We should be able to
exploit this to optimize, shouldn't we?  In order to count the
number for M, count_distance() must have traversed all the ancestor
commits of M before coming up with its answer, so by the time we see
M's score (1000/2 - 400) and realize that it is the best one we have
seen so far that we aim to improve, we know that the score for all
the commits we have seen during that traversal cannot be better than
M's, no?

If M can reach 700 commits (instead of 400), the argument goes the
other way---anything that can reach M can reach even more, so they
cannot be any closer to the middle than M.  Knowing what can reach
M, however, needs something like reachability bitmap, though.