On Wed, 3 Oct 2007, Jeff King wrote: > > Try profiling the code, and you will see that the creation of the hashes > is totally dwarfed by the comparisons. So yes, you might be able to > speed up the creation code, but it's going to have a minimal impact on > the overall run time. Yeah. Oprofile is your friend. The biggest win would be to avoid calling diffcore_count_changes() in the first place, and we actually do that in the caller by looking at the size of the files. However, while that prunes out the *really* obvious cases, it's not nearly enough, since there tends to be very limited filesizes anyway. What we could also do is to pass in the minimum similarity score, and use that to at least exit early. We currently pass in "delta_limit" which is close, but we never use it, and we really probably would be better off passing in the minimum score itself and perhaps be able to exit early from diffcore_count_changes. However, while I did think about doing that, I came to the conclusion that we'd still always end up having to look at least at *half* the hash data (for the default 50% score), so while it would help, it again wouldn't be a matter of orders-of-magnitudes, and it looked like the end result would be unnecessarily complex too. The best option, of course, would be to use a similarity hash to make the initial choice. I think we had one at one point, but it wasn't fine-grained enough. But it might be interesting to do that as a filter in *front* of the more expensive diffcore_count_changes() call. We had some "similarity fingerprint" code at some point using a Rabin polynomial. It wasn't reliable enough to be used as a direct score, but maybe it can be used as a first-line "we know this isn't even worth doing the expensive stuff on". Linus - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html