Re: git and larger trees, not so fast?

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Thu, 9 Aug 2007 10:38:53 -0700 (PDT)

On Thu, 9 Aug 2007, Linus Torvalds wrote:
> 
> Doing an ltrace on it shows tons and tons of:
> 
> 	...
> 	strlen("35")
> 	strlen("349")
> 	calloc(1, 72)
> 	memcpy(0x73034e, "10/", 3)
> 	memcpy(0x730351, "349", 4)
> 	memmove(0x2ab637f41e80, 0x2ab637f41e78, 781768)
> 	...
> 
> but I haven't looked at where they come from yet.

Ouch. It's the diffing between HEAD and the index, and it's all from 
"add_index_entry()", which sorts the index array using an insertion sort. 
So when the index array gets large, that sort spends all its time in huge 
memmove() calls.

The silly thing, of course, is that we don't even "need" to do that: both 
the index and the trees are really sorted already, so we could just 
interleave them. But since we read them separately, the thing just sucks.

We've fixed other similar cases of this we had (diffing trees against each 
other) by walking the trees together, but the "index vs tree" diff (and 
merge) is the one remaining place where we still use the original stupid 
algorithm. So you'll see this performance problem for 

 - diff tree against index ("git diff HEAD"
 - merge tree into index ("git read-tree -m HEAD")

which both do the stupid index/tree filling.

So this is all O(n**2), which is why we haven't reacted very much - it 
doesn't show up nearly as much with the kernel. Also, with a smaller set 
of files, it would tends to fit in the L2 cache of most competent CPU's. 
So not only is it n**2, you get the cache trashing behaviour too, and 
that, I think, is what really causes it to fall off the cliff edge!

Gaah. This shouldn't be *that* hard to fix, but I'm not entirely sure I'll 
have time today.

Diffing the index against the tree *should* be instantaneous. It should be 
no more costly than reading the tree itself (which is 0.191 seconds for 
me: test "git read-tree -m HEAD" vs "git read-tree HEAD") and reading the 
index (which is almost instantaneous - the only way I can test it is by 
doing something like "git update-index --refresh", and that's 0.131 
seconds, but that includes all the 100,000 "lstat()" calls).

So basically, we're spending several seconds just doing stupid 
make-believe work and moving the index array around. Ouch.

Anyway, the good news is that this is by no means fundamental. It's a 
small and stupid detail. The only thing that makes it at all painful is 
that this is in some low-level crud that we haven't touched in *ages*, so 
I've long since swapped out all my recollection of how we do it.

(We basically do:

	read_cache();

followed by

	unpack_trees();

and each of those *on*its*own* is pretty cheap, but when we unpack trees 
into an already populated index, the end result is ugly.

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html