Re: absurdly slow git-diff

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Fri, 7 Nov 2008 13:37:29 -0800 (PST)

On Sat, 8 Nov 2008, Abhijit Menon-Sen wrote:
>
> If anyone's interested, the files are http://toroid.org/misc/1 and
> http://toroid.org/misc/2

Btw, you can see this by just doing

	git diff 1 2

without even doing "git init" or doing any actual git repository.

> Does anyone understand why this slowdown might happen or have
> suggestions about where I should look for it?

Sure. It's actually fairly simple. You're hitting a O(n^2) thing (possibly 
higher), and it's triggered by the fact that almost all your lines are 
identical, ie you have a file that basically has 40,000 lines of each of

	xxxx: xxx, xx xxx xxxx xx:xx:xx +xxxx
	xx: xxxx xxxxxxxxxxx <xxxx@xxxxxxxx>
	xxxx: xxxx xxxxxxxxxxx <xxxx@xxxxxxxx>

and 30,000 of 

	* xxxxx xxxxx (xxxxxx.xxxx xxx xxxxx (\Xxxxxx) xxxxxx.xxxxxx {xxx}

with a smattering of others. And this is a case where the internal git 
implementation does really badly. And nobody has really cared before, 
because nobody has ever had a case that mattered.

There's a number of different 'diff' algorithms, and it looks like GNU 
diff has one that avoids the O(n^2) case for this case.

I'm adding Davide as the original author of the diff library to the cc. 
I'm also adding Pierre, since he was talking about trying to implement
another diff algorithm (although I'm not at all sure that the patience 
diff really would help this case at all).

		Linus
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html