On Thu, 17 Feb 2011, Santi B?jar wrote: > On Wed, Feb 16, 2011 at 5:45 PM, Martin von Zweigbergk > <martin.von.zweigbergk@xxxxxxxxx> wrote: > > On Wed, 16 Feb 2011, Santi B?jar wrote: > > > >> On Wed, Feb 16, 2011 at 3:03 AM, Martin von Zweigbergk > >> <martin.von.zweigbergk@xxxxxxxxx> wrote: > >> > > >> > .-u@{0} > >> > / > >> > .---u@{1} > >> > / > >> > x---y-----u@{2} > >> > \ > >> > .---u@{3}---b > >> > \ > >> > .-u@{4} > >> > > >> > > >> > I have an idea inspired by bisection, Thomas's exponential stride, and > >> > what someone (you?) mentioned the other day about virtual merge > >> > commits. I haven't tried it out, but let me know what you think. I'll > >> > try to explain it using an example only: > >> > > >> > Exponential stride phase: > >> > 1. candidates={ u@{0} } > >> > merge-base b $candidates -> y, _not_ in $candidates > >> > 2. candidates={ u@{1} u@{2} } > >> > merge-base b $candidates -> y, _not_ in $candidates > >> > 3. candidates={ u@{3} u@{4} u@{5} u@{6} } > >> > merge-base b $candidates -> u@{3}, in $candidates > >> > >> Doesn't it indicate that u@{3} is the commit we are looking for? I > >> haven't found a counterexample... > > > > Yes, of course. Stupid me ;-). Forget about the other half. (I think > > that's what I did manually to match the sha1 back to the ref name, but > > that is of course complete non-sense to do in the script.) > > > >> If this is true the following patch can implement it for git-pull.sh and > >> git-rebase.sh (sorry if it is space damaged): > > > > Thanks! Will have a closer look at it later today. If I understand > > correctly, you simply call merge-base with the _entire_ reflog. I > > Yes, that is the idea (plus the old remote hash in case of git-pull) > > > would have thought that would be slow, but it's great if that is fast > > enough. > > Yes, I think it is fast enough in the normal case. Even feeding the > entire git.git's master, ~25000 revisions, it takes around 2-4 seconds > only: > > $ git rev-list origin/master | wc -l > 24380 > $ time git merge-base $(git rev-list origin/master) > 9971d6d52c5afeb8ba60ae6ddcffb34af23eeadd > > real 0m4.014s > user 0m1.520s > sys 0m2.284s > > (2.5GHz CPU) I finally got around to doing some tests on this myself. I used git.git as of mid Feb, which at that time had 10010 commits in master, following only the first parent. I took the first 563 commits from the todo branch and transplanted onto master~10000 (there were some conflicts after about 563 commits and I figured that would be enough anyway). I then rebased the resulting branch (let's call it 'u') against master~9990, then against master~9980 and so on to get a reflog with 1001 entries for u. I then created another branch 'b' based on u@{10}, u@{100} and @{1000}, for different runs of the tests. I created one additional commit on b in each case. I then rebased b with master, using the following algorithms to find the base to rebase from: manual: simply calling 'git rebase --onto u b~1' linear: same algorithm as in 'git pull', which linearly walks the reflog until a commit that b contains is found merge-base: the base will be calculated as 'git merge-base b $(git ref-list -g u)' exponential: like merge-base, but start with only u@{0}, then {u@{1},u@{2}} and so on until a commit that b contains is found These are the results: u@{10} u@{100} u@{1000} manual 0m0.535s 0m1.164s 0m1.415s linear 0m1.245s 0m37.367s 5m10.068s merge-base 0m14.490s 0m15.409s 0m15.508s exponential 0m1.056s 0m6.175s 0m27.221s (1.8 GHz Athlon 64). This clearly shows that the linear algorithm from git pull is not good enough when rebasing older branches (i.e. branches whose upstream has many reflog entries created after the branch itself was created). The time it takes the "merge-base" algorithm is quite independent on how old the branch is, but with this quite long and branchy reflog (but not too dissimilar from git.git's pu?), it takes quite a while to calculate it. I think this is also too slow to be acceptable as a default. I would personnally be happy if the "exponential" algorithm was used by git rebase default. I suppose not everyone would agree that the convenience outweighs the performance cost, though. OTOH, a slower algorithm has been used in git pull for a long time and it seems like not many people have really been bothered by that. Also see the following paragraphs. I also ran the same tests with an upstream branch that was never force-updated. For these test cases, I created a reflog such that u@{$i} = master~$((10 * $i)). Since the upstream branch was know never to have been force-updated in this case, the "manual" test case was simply 'git rebase u'. These are the results: u@{10} u@{100} u@{1000} manual 0m0.885s 0m6.126s 0m52.248s linear 0m1.349s 0m39.688s 5m28.753s merge-base 0m1.160s 0m1.699s 0m1.901s exponential 0m0.769s 0m4.342s 0m7.360s Not surprisingly, the linear algorithm is slow in these cases as well. What's more interesting here is that the last two algorithms are actually faster than the plain 'git rebase u'. This is caused by --ignore-if-in-upstream flag to format-patch. Since the other three algorithms try to figure out what the base was and pass the range from the guessed base to the branch (e.g. u@{100}..b) to format-patch, the --ignore-if-in-upstream to that command effectively becomes a no-op. Although this makes rebase faster in the case of a non-force-updated upstream, it may also be a problem in some cases. This was something that I had not thought about until I started timing the calls. One reason I can think of when the --ignore-if-in-upstream is useful is when the upstream branch has been rebased, but this is exactly the case when guessing the old base is useful and solves the problem in a better way anyway. However, if a commit on the upstream branch was cherry-picked from some commit on the current branch above its base (i.e. in u@{x}..b), then that would not be detected by --ignore-if-in-upstream and could result in unnecessary merge conflicts. I don't know how common this case is. The above also applies to 'git pull', of course, but the ways of getting identical patches in upstream are probably different (more likely by 'git am' than 'git cherry-pick' perhaps). I think this is a useful feature. I'm just not sure how to balance the performance vs convenience. Worst case, this could probably become a command line option and configuration. I guess 'git pull' should use the same algorithm. If we decide to use configation, maybe git-pull's default would need to be different to be backward compatible. Any thoughts? > But, as Junio showed, it has problems when the reflog lenght is too > large. Maybe git-merge-base can learn the --stdin flag, or we could > process the reflog in batches of 1000 (?) entries, ... but the nice > property of using the entire reflog is that the output is what you are > looking for, if you take the first 1000 entries you have to check if > the output is one of these entries. Since I think the exponential algorithm seems the best choice, we could probably just limit it to a certain number of entries, but maybe it's better to implement a --stdin flag to merge-base. It could be useful for others too. /Martin -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html