Re: first bisection step takes quite a while

Christian Couder <christian.couder@xxxxxxxxx> · Fri, 21 Feb 2025 10:15:09 +0100

On Fri, Feb 21, 2025 at 2:41 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:

> So, here is something that _could_ be the beginning of a patch, but
> just to illustrate what I tried.
>
>  * In do_find_bisection(), we try each commit on the incoming commit
>    list (which is sorted the way rev-list emits, probably reversed)
>    and count how many commits in the set each merge commit can reach
>    (which is called "weight") in the "honest and stupid" way.  I try
>    to collect these merges in a linear array, and try from the
>    middle to older and newer.  As the loop to compute weight for
>    merges have an early-exit clause that says "oh, this is good
>    enough", this may improve our odds to find a good enough merge
>    early.

Yeah, it seems to me that in practice this is a bit like bisecting on
the first parents first. It would be nice if we had added an option to
bisect on the first parents first, so that we could compare your
improvement and that option.

>  * The "this is good enough" logic currently allows us to be within
>    0.1% of the real halfway point.  Until the candidate set becomes
>    small enough, we could loosen the criteria to allow larger, say
>    3%, slack.  This code is written but not enabled (with "0 &&").

If we want to do this, I think we could loosen the criteria even if
the candidate set is small. Weights are integers so when the number of
candidates is around 33 or less, a 3% criteria will mean an exact
match. Then the last 5 steps or so (as 2^5 = 32) would still be
performed in the same way (with an exact match).

>  * After computing the weight for a merge in "honest and stupid"
>    way, we know what other commits in the set it can reach.  If the
>    weight we computed is way smaller than the half the number of
>    commits in the set, that means these commits we can reach from
>    the merge we are looking at would score even lower.  We could
>    mark them as not-viable before clearing the list to check next
>    merge with "honest and stupid" way.  Again, this code is written
>    but not enabled.
>
> So, in short, I have three ideas, and with the first one (that
> is the most straightforward and least error prone) alone, it seems
> that we gain significant speedup.
>
> The current code took ~20 minutes for me and its result is
>
> $ git bisect start --no-checkout 09fbf3d5020 96d8eab5d0a1
> Bisecting: 581164 revisions left to test after this (roughly 19 steps)
> [2c71ab4bb465c79a4687cc2fd5012e470aebdb1f] Merge branch 'for-upstre...
>
> There are 1144459 commits in the range, and the point chosen by
> bisection can reach 563294 of them.  563294*2 == 1126588, so we are
> 1144459 - 1126588 = 17871 commits away from the theoretical halfway.
>
> With the "let's try from the midway merges" approach without
> changing anything else, I get a different commit (because the
> original algorithm is taking "good enough" early exit), and it took
> about 30 seconds.
>
> $ git bisect start --no-checkout 09fbf3d5020 96d8eab5d0a1
> Bisecting: 572238 revisions left to test after this (roughly 19 steps)
> [eafdca4d7010a0e019aaaace3dd71b432a69b54c] Merge tag 'staging-4.18-...
>
> The size of the original range is the same, of course, 1144459
> commits, and the point chosen by bisection reaches 572220 of them.
> Since 572220*2 = 1144440, we are 1144459 - 1144440 = 19 commits from
> the theoretical halfway.
>
> My current thinking is that the heuristics #1 (which is enabled in
> my experiment and in the following patch) is good enough, #2
> (loosening the "good enough" threshold) is probably not very
> effective, and #3 (discard ones that are closer to the good end than
> a merge that is known to be not viable) might be interesting to
> pursue further but probably tricky to get right.

I agree that #1 is probably good enough.

About #2, I think it could be worth implementing as an option if it is
effective in some cases, but the criteria should be loosened even if
the candidate set is small. The amount of code to implement it is very
small and it's possible that, for some users, having to sometimes
perform one more step of testing is not a big deal, compared to
bisecting speed.

About #3, I think that implementing an option to bisect on the first
parents first is likely more useful than implementing it.

Thanks.