Re: [PATCH 1/3] Lazily open pack index files on demand

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Mon, 28 May 2007 20:26:42 -0700 (PDT)

On Mon, 28 May 2007, linux@xxxxxxxxxxx wrote:
> 
> Even losing a constant factor of 2, it still seems like it might offer a 
> factor-of-2 speedup for large repositories.

Well, not so much according to the numbers I had.

Yes, SHA-1's are very nicely distributed on a larger scale, but on a 
_small_ scale they aren't. 

So you end up being able to get very good initial guesses for the first 
few iterations, but once you get "close enough", you can't do anything at 
all, and then you're actually worse off.

Also, please do realize that the binary search is actually a *smart* 
binary search, which does a radix-256 fan-out at the beginning, getting 
rid of the first 8 levels of searching for free.

The Newton-Raphson thing can do that too (and in my trial patch, did), but 
it means that you actually get rid of just the initial guess (the one that 
worked the best!) and you still have the problem that once you get close 
enough, the distribution at a local level is not at all nice and linear.

So what I did with my patch (you can find it in the archives - search for 
Newton-Raphson, I'd say about 3 months ago) was to do three cycles of 
approximating, and then a linear search. The linear search has good cache 
behaviour, so it's not as horrid as it might sound, but for some numbers I 
did, my approximate thing would hit on the exact first entry about half 
the time, but would have to walk up to ~20 entries occasionally.

Which meant that the binary search (with the initial radix-256 fan-out) 
actually mostly outperformed the Newton-Raphson thing.

Also, object lookup is certainly noticeable, but the biggest cost of it is 
the cache misses, and even then it's not really _that_ noticeable. And 
it's really neither slow nor stupid as it is.

So I'd love to see somebody try to do a _proper_ apprixmation of Newton- 
Raphson, not the quick hack I did. But I think factor-of-two is actually 
optimistic, even for pretty large repos. And it would have to be no worse 
than what we have now for average-sized ones.

(And object lookup time is generally not the dominant cost, so while it's 
noticeable, cutting it down by even a whole 50% isn't going to speed up 
any normal git operations by more than a couple of percent.. Generating 
diffs in particular is a much more costly op for things like "git blame")

		Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html