Re: git rebase chokes on directory -> symlink -> directory

Junio C Hamano <junkio@xxxxxxx> · Thu, 10 May 2007 23:04:05 -0700

"Shawn O. Pearce" <spearce@xxxxxxxxxxx> writes:

> Junio C Hamano <junkio@xxxxxxx> wrote:
>>  * git-rebase with -m is dog slow.  There were people who
>>    advocated to make it the default, but they probably are
>>    either working in a very small project, or working on a
>>    filesystem that even git-apply is slow that the speed
>>    difference does not matter to them.
> ...
> But that's not the situation everyone else has, so its reasonable
> that -m ain't the default.  ;-)

Well, that is not the conclusion you should be drawing from
this. If rebase -m is 10x slower than without -m in cases where
the rename handling does not matter, there is something wrong.

And what is wrong in this case is that the unpack-trees tree
merging code, which is used everywhere in git to do branch
switching and merges, is way too inefficient.

When merge-recursive is instructed to merge another tree with
the current tree using an ancestor, while taking the index into
account, it basically does the three-way tree-level merge one
path at a time, even when subdirectory at quite high level
matches identically across three trees.

The situation is the same for switching branches.  If two
branches of the kernel project (22k files spread across 1300
directories) differ at a file at the toplevel (e.g. v2.6.21
which changes only Makefile), we still read the index, the
current tree, and the other branch, and match all 22k files one
by one to compute the resulting index entry, by first removing
the current index entry and then stuffing the result entry in
the index, all the while trashing the cache-tree.  Then we
recompute all 1300 tree objects and write them out, even though
we should be able to notice that none of the toplevel 17
subdirectories have changed, and all we have to do is to rehash
one blob and recompute only one tree object at the toplevel.  We
boast how lightweight git branches are and how fast switching
between two branches is, but that's a serious lie.  If done
properly, we should be able to switch branches in a time roughly
proportional to the number of paths different between the
branches.  Currently, the time is proportional to the size of
the tree, no matter how small the change between trees are.

git-apply, which is used by rebase without -m, is optimized to
make it proportional to the size of the change.  It obviously
knows to only touch the affected paths (because the patch does
not talk about unaffected paths) and leave the others intact,
but also avoids expensive tree recomputation for unaffected
directories, by properly maintaining the cache-tree data in the
index.

IIRC, Linus said unpack-trees was beyond repair several months
ago, and I tend to agree with him.  Currently the first thing
unpack-trees does is to discard cache-tree from the index,
because the code does not properly invalidate affected paths,
and it is probably way too cumbersome to add it to various
places the code modifies the index (I haven't looked at it
recently, so maybe somebody can try it and prove me wrong).

My gut feeling is that we may be better off redoing the tree
level merge infrastructure from scratch, and make a new one that
is optimized for trees with small differences.  There is a
prototype code called test-para in 'pu' that implements such a
multi-tree walk, and also we've had its precursor (by Linus)
called git-merge-tree in 'master' for quite a long time, but
unfortunately neither has recently seen any activity.

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html