Re: [PATCH] rebase: be cleverer with rebased upstream branches

Santi Béjar <santi@xxxxxxxxxxx> · Mon, 14 Mar 2011 00:09:01 +0100

[Also quoted text from On Sun, Mar 13, 2011 at 4:14 AM, Martin von
Zweigbergk <martin.von.zweigbergk@xxxxxxxxx>]

On Sat, Mar 12, 2011 at 10:15 PM, Martin von Zweigbergk
<martin.von.zweigbergk@xxxxxxxxx> wrote:
> On Thu, 17 Feb 2011, Santi B?jar wrote:
>
>> On Wed, Feb 16, 2011 at 5:45 PM, Martin von Zweigbergk
>> <martin.von.zweigbergk@xxxxxxxxx> wrote:
>> > On Wed, 16 Feb 2011, Santi B?jar wrote:
>> >
>> >> On Wed, Feb 16, 2011 at 3:03 AM, Martin von Zweigbergk
>> >> <martin.von.zweigbergk@xxxxxxxxx> wrote:
>> >> >
>> >> >        .-u@{0}
>> >> >       /
>> >> >      .---u@{1}
>> >> >     /
>> >> > x---y-----u@{2}
>> >> >     \
>> >> >      .---u@{3}---b
>> >> >       \
>> >> >        .-u@{4}
>> >> >
>> >> >
>> >> > I have an idea inspired by bisection, Thomas's exponential stride, and
>> >> > what someone (you?) mentioned the other day about virtual merge
>> >> > commits. I haven't tried it out, but let me know what you think. I'll
>> >> > try to explain it using an example only:
>> >> >
>> >> > Exponential stride phase:
>> >> > 1. candidates={ u@{0} }
>> >> >   merge-base b $candidates -> y, _not_ in $candidates
>> >> > 2. candidates={ u@{1} u@{2} }
>> >> >   merge-base b $candidates -> y, _not_ in $candidates
>> >> > 3. candidates={ u@{3} u@{4} u@{5} u@{6} }
>> >> >   merge-base b $candidates -> u@{3}, in $candidates
>> >>
>> >> Doesn't it indicate that u@{3} is the commit we are looking for? I
>> >> haven't found a counterexample...
>> >
>> > Yes, of course. Stupid me ;-). Forget about the other half. (I think
>> > that's what I did manually to match the sha1 back to the ref name, but
>> > that is of course complete non-sense to do in the script.)
>> >
>> >> If this is true the following patch can implement it for git-pull.sh and
>> >> git-rebase.sh (sorry if it is space damaged):
>> >
>> > Thanks! Will have a closer look at it later today. If I understand
>> > correctly, you simply call merge-base with the _entire_ reflog. I
>>
>> Yes, that is the idea (plus the old remote hash in case of git-pull)
>>
>> > would have thought that would be slow, but it's great if that is fast
>> > enough.
>>
>> Yes, I think it is fast enough in the normal case. Even feeding the
>> entire git.git's master, ~25000 revisions, it takes around 2-4 seconds
>> only:
>>
>> $ git rev-list origin/master | wc -l
>> 24380
>> $ time git merge-base $(git rev-list origin/master)
>> 9971d6d52c5afeb8ba60ae6ddcffb34af23eeadd
>>
>> real  0m4.014s
>> user  0m1.520s
>> sys   0m2.284s
>>
>> (2.5GHz CPU)
>
>
> I finally got around to doing some tests on this myself. I used
> git.git as of mid Feb, which at that time had 10010 commits in master,
> following only the first parent. I took the first 563 commits from the
> todo branch and transplanted onto master~10000 (there were some
> conflicts after about 563 commits and I figured that would be enough
> anyway). I then rebased the resulting branch (let's call it 'u')
> against master~9990, then against master~9980 and so on to get a
> reflog with 1001 entries for u. I then created another branch 'b'
> based on u@{10}, u@{100} and @{1000}, for different runs of the
> tests. I created one additional commit on b in each case. I then
> rebased b with master, using the following algorithms to find the base
> to rebase from:
>
>  manual: simply calling 'git rebase --onto u b~1'
>
>  linear: same algorithm as in 'git pull', which linearly walks the
>  reflog until a commit that b contains is found
>
>  merge-base: the base will be calculated as 'git merge-base b $(git
>  ref-list -g u)'
>
>  exponential: like merge-base, but start with only u@{0}, then
>  {u@{1},u@{2}} and so on until a commit that b contains is found

exp(n,m): like merge-base, but start with n candidates {u@{0},
..., u@{n-1}}, then n*m candidates and so on until a commit that b
contains is found.

>
> These are the results:

These are best timing out of three runs, mean, only the first one? Hot-cache
for all tests?

>
>                 u@{10}     u@{100}    u@{1000}
> manual         0m0.535s    0m1.164s    0m1.415s
> linear         0m1.245s   0m37.367s   5m10.068s
> merge-base    0m14.490s   0m15.409s   0m15.508s
> exp(1,2)       0m1.056s    0m6.175s   0m27.221s
> exp(10,10)     0m1.950s   0m20.031s   0m18.215s
> exp(7,7)       0m1.310s    0m6.851s   0m16.757s
>
> (1.8 GHz Athlon 64).
>
> This clearly shows that the linear algorithm from git pull is not good
> enough when rebasing older branches (i.e. branches whose upstream has
> many reflog entries created after the branch itself was created).
>
> The time it takes the "merge-base" algorithm is quite independent on
> how old the branch is, but with this quite long and branchy reflog
> (but not too dissimilar from git.git's pu?), it takes quite a while to
> calculate it. I think this is also too slow to be acceptable as a
> default.
>
> I would personnally be happy if the "exponential" algorithm was used
> by git rebase default. I suppose not everyone would agree that the
> convenience outweighs the performance cost, though.

I don't agree with this. For the normal case there is no performance cost
(manual 0.5s, exp(7,7) 1.3s). There is performance cost (manual 1.4s, exp(7,7)
16.7s) when you need it, when your upstream has been rebased a long ago (in
reflog entries).

> OTOH, a slower
> algorithm has been used in git pull for a long time and it seems like
> not many people have really been bothered by that. Also see the
> following paragraphs.
>
> I also ran the same tests with an upstream branch that was never
> force-updated. For these test cases, I created a reflog such that
> u@{$i} = master~$((10 * $i)). Since the upstream branch was know never
> to have been force-updated in this case, the "manual" test case was
> simply 'git rebase u'. These are the results:
>
>                 u@{10}     u@{100}    u@{1000}
> manual         0m0.885s    0m6.126s   0m52.248s
> linear         0m1.349s   0m39.688s   5m28.753s
> merge-base     0m1.160s    0m1.699s    0m1.901s
> exp(1,2)       0m0.769s    0m4.342s    0m7.360s
> exp(10,10)     0m0.700s    0m2.535s    0m3.110s
> exp(7,7)       0m0.653s    0m2.332s    0m3.506s
>
> Not surprisingly, the linear algorithm is slow in these cases as well.
>
> What's more interesting here is that the last two algorithms are
> actually faster than the plain 'git rebase u'. This is caused by
> --ignore-if-in-upstream flag to format-patch. Since the other three
> algorithms try to figure out what the base was and pass the range from
> the guessed base to the branch (e.g. u@{100}..b) to format-patch, the
> --ignore-if-in-upstream to that command effectively becomes a no-op.
>
> Although this makes rebase faster in the case of a non-force-updated
> upstream, it may also be a problem in some cases. This was something
> that I had not thought about until I started timing the calls. One
> reason I can think of when the --ignore-if-in-upstream is useful is
> when the upstream branch has been rebased, but this is exactly the
> case when guessing the old base is useful and solves the problem in a
> better way anyway. However, if a commit on the upstream branch was
> cherry-picked from some commit on the current branch above its base
> (i.e. in u@{x}..b), then that would not be detected by
> --ignore-if-in-upstream and could result in unnecessary merge
> conflicts. I don't know how common this case is.
>
> The above also applies to 'git pull', of course, but the ways of
> getting identical patches in upstream are probably different (more
> likely by 'git am' than 'git cherry-pick' perhaps).
>
> I think this is a useful feature. I'm just not sure how to balance the
> performance vs convenience. Worst case, this could probably become a
> command line option and configuration. I guess 'git pull' should use
> the same algorithm. If we decide to use configation, maybe git-pull's
> default would need to be different to be backward compatible.
>
> Any thoughts?

I think it is worth, as it looks like it only affects those who need
the feature.
The exp(7,7) or similar seems a good candidate.

HTH,
Santi
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html