Re: What's cooking in git.git (topics)

Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> · Wed, 3 Oct 2007 10:53:27 -0700 (PDT)

On Wed, 3 Oct 2007, Jeff King wrote:
>
> Try profiling the code, and you will see that the creation of the hashes
> is totally dwarfed by the comparisons. So yes, you might be able to
> speed up the creation code, but it's going to have a minimal impact on
> the overall run time.

Yeah. Oprofile is your friend.

The biggest win would be to avoid calling diffcore_count_changes() in the 
first place, and we actually do that in the caller by looking at the size 
of the files. However, while that prunes out the *really* obvious cases, 
it's not nearly enough, since there tends to be very limited filesizes 
anyway.

What we could also do is to pass in the minimum similarity score, and use 
that to at least exit early. We currently pass in "delta_limit" which is 
close, but we never use it, and we really probably would be better off 
passing in the minimum score itself and perhaps be able to exit early from 
diffcore_count_changes.

However, while I did think about doing that, I came to the conclusion that 
we'd still always end up having to look at least at *half* the hash data 
(for the default 50% score), so while it would help, it again wouldn't be 
a matter of orders-of-magnitudes, and it looked like the end result would 
be unnecessarily complex too.

The best option, of course, would be to use a similarity hash to make the 
initial choice. I think we had one at one point, but it wasn't 
fine-grained enough. But it might be interesting to do that as a filter in 
*front* of the more expensive diffcore_count_changes() call.

We had some "similarity fingerprint" code at some point using a Rabin 
polynomial. It wasn't reliable enough to be used as a direct score, but 
maybe it can be used as a first-line "we know this isn't even worth doing 
the expensive stuff on".

			Linus
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html