Re: FFmpeg considering GIT

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Sun, 6 May 2007 09:38:48 -0700 (PDT)

On Sun, 6 May 2007, Karl Hasselstr?m wrote:
> 
> OK, now I've tested it, and just as you said, it works (and is _very_
> useful) but looks like crap. :-)
> 
> Is there any fundamental reason why
> 
>   gitk -- some/path/name
> 
> generates a nice, connected graph, while
> 
>   gitk -S'some string'
> 
> generates disconnected spaghetti?

There is a reason, and it's fairly fundamental: the path limiting code is 
deeply embedded in the revision walking, and I've spent a fair amount of 
effort on making that work and efficient as hell (it's one of the few 
areas in git where I'm probably still the main author). Because it's 
literally what I do 90% of the time: for me, the path-limiting code is 
basically _the_ most important git feature, and I care very deeply.

In contrast, the "-S" thing is not actually part of the revision walking 
at all, and is a totally separate phase that is done when revisions are 
_shown_. I almost never use it myself, and it grew out of a totally 
separate effort by Junio. 

> Or could the latter be made to use the same parent-rewriting logic as 
> the first?

It would probably be possible to make the -S logic be another part of the 
"prune_fn()" logic in revision.c, and it might even simplify some of the 
logic, but I suspect it would actually suck really really badly from a 
performance standpoint.

Why? Because the prune_fn() logic is done when we generate the revision 
graph, which is generally something that a lot of the operations have to 
do up-front before they can do _anything_ else. Eg, any revision limiter 
(and that's a very common case) like "v2.6.21.." will cause the revision 
pruning to happen synchronously and early on.

And the path-limiting is *fast*. It's so incredibly fast that people don't 
really realize how fast it is. And it absolutely needs to be fast, because 
when you do something like "gitk v2.6.18.. drivers/" on the kernel you end 
up doing a _lot_ of tree comparisons. It's why I'm pretty sure nobody else 
can ever do what git does - it takes full advantage of how git can tell 
that a whole subdirectory hasn't changed without even recursing into it.

In contrast, "-S" is _slow_. It's a really really expensive operation. Git 
makes generating diffs faster than just about anything else, but it's 
still really expensive. This is a really unfair comparison, but:

	time git log drivers/net/ > /dev/null

	real    0m1.488s
	user    0m1.444s
	sys     0m0.040s

ie we can do the log pruning for the whole kernel git history on a 
subdirectory in less than two seconds. 

Try to compare it with

	time git log -Sdrivers/net/ > /dev/null

and I suspect you won't have the patience to wait for the end result.

And yeah, the operations are fundamentally very very different, and yes, 
the latter operation is really really expensive (which is why I said it's 
a really unfair comparison). But the point is that the expense comes from 
how git has been designed: seeing differences in the paths is cheap by 
design (it's how the data structures are laid out), but seeing differences 
in actual diffs means that we have to fully generate each diff for each 
revision!

A different approach to the underlying datastructures could change the 
equation. For example, if the fundamental data representation was the 
"diff" (rather than the "whole tree") maybe -S would be as fast as path 
limiting. But you'd *really* suck for other things.

To summarize a long story: the path limiting is simply more fundamental in 
git. Both by design, and then - obviously partly _due_ to that - by pure 
effort we've spent on it. It's something very deep and very important. In 
comparison, the -S thing is a cute extra feature, nothing really "deep".

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html