Re: [RFC PATCH 3/3] git-rebase.sh: make git-rebase--interactive the default

Johannes Schindelin <Johannes.Schindelin@xxxxxx> · Thu, 21 Jun 2018 12:57:28 +0200 (DST)

Hi Elijah,

On Wed, 20 Jun 2018, Elijah Newren wrote:

> On Sun, Jun 17, 2018 at 2:44 PM, Johannes Schindelin
> <Johannes.Schindelin@xxxxxx> wrote:
> 
> > I was really referring to speed. But I have to admit that I do not have
> > any current numbers.
> >
> > Another issue just hit me, though: rebase --am does not need to look at as
> > many Git objects as rebase --merge or rebase -i. Therefore, GVFS users
> > will still want to use --am wherever possible, to avoid "hydrating"
> > many objects during their rebase.
> 
> What is it that makes rebase --am need fewer Git objects than rebase
> --merge or rebase -i?

Multiple things. The most obvious thing: --cherry-pick. When you call `git
rebase --am`, it does not try to exclude the patches by looking at
`rev-list --cherry-pick --right-only <upstream>..<head>`

This really bit us in GVFS Git because we have several thousand developers
working on the same code base, and you probably read the numbers elsewhere
how many commits are introduced while a developer goes on so much as a
week-long vacation.

Next, rename detection.

I think that bit us even harder because it tries to look at all the files
that have been changed in upstream in the meantime, which can be *a lot*.
And if you know that moving files outside of your cozy little sparse
checkout is unlikely (or moving from the outside into your checkout), then
all this hydration is pretty wasteful. That's why we had to switch off
rename detection in GVFS Git.

> My guess at what objects are needed by each type:
> 
> At a high level, rebase --am for each commit will need to compare the
> commit to its parent to generate a diff (which thus involves walking
> over the objects in both the commit and its parent, though it should
> be able to skip over subtrees that are equal), and then will need to
> look at all the objects in the target commit on which it needs to
> apply the patch (in order to properly fill the index for a starting
> point, and used later when creating a new commit).

No, in --am mode, it does not even need to look at all the objects in the
target commits. Only those objects that correspond to the files that are
touched by the diff.

In a massive code base, such as Windows', this makes a huge difference.
Even comparing the number of files touched by the patches that are to be
rebased to the number of files that were touched in upstream since you
rebased last is ridiculous. And a three-way merge needs to consider that
latter set of files.

> If the application of the diff fails, it falls back to a three-way
> merge, though the three-way merge shouldn't need any additional objects.

The three-way merge needs to reconcile the diff between branch point and
your changes to the diff between branch point and upstream's changes. The
latter can be a lot bigger than the former, in which case --am's
complexity (O(former)) is much nicer than --merge's (O(former u latter)).

> So, to summarize, rebase--am needs objects from the commit being
> rebased, its parent, and the target commit onto which it is applying,
> though it can short circuit some objects when the commit and its parent
> have matching subtree(s).
> 
> rebase -i, if I understand correctly, does a three-way merge between
> the commit, its parent, and the target commit.  Thus, we again walk
> over objects in those three commits; I think unpack_trees() does not
> take advantage of matching trees to avoid descending into subtrees,
> but if so that's an optimization that we may be able to implement
> (though it would require diving into unpack_trees() code, which is
> never easy...).
> 
> (Side notes: (1) rebase --merge is basically the same as rebase -i
> here; it's path to reaching the recursive merge machinery is a bit
> different but the resulting arguments are the same; (2) a real merge
> between branches would require more objects because it would have to
> do some revision walking to find a merge base, and a real merge base
> is likely to differ more than just the parent commit.  But finding
> merge bases isn't relevant to rebase -m or rebase -i)
> 
> Is there something else I'm missing that fundamentally makes rebase -i
> need more objects?
> 
> > As to speed: that might be harder. But then, the performance might already
> > be good enough. I do not have numbers (nor the time to generate them) to
> > back up my hunch that --am is substantially faster than --merge.
> 
> I too have a hunch that --am is faster than --merge, on big enough
> repos or repos with enough renames.  I can partially back it up with
> an indirect number: at [1], it was reported that cherry-picks could be
> sped up by a factor of 20-30 on some repos with lots of renames.  I
> believe there are other performance improvements possible too, for the
> --merge or -i cases.
> 
> I'm also curious now whether your comment on hydrating objects might
> uncover additional areas where performance improvements could be made
> for non-am-based rebases of large-enough repos.
> 
> Elijah
> 
> [1] https://public-inbox.org/git/CABPp-BH4LLzeJjE5cvwWQJ8xTj3m9oC-41Tu8BM8c7R0gQTjWw@xxxxxxxxxxxxxx/
> (see also Peter's last reply in that thread, and compare to his first
> post)

I am too unfamiliar with the rename detection to understand the details,
but you are familiar with it, so I trust your judgement!

Ciao,
Dscho