On Tue, Jan 20, 2009 at 04:59:57PM +0100, Björn Steinbrink wrote: > In diffcore_rename, we assume that the blob contents in the filespec > aren't required anymore after estimate_similarity has been called and thus > we free it. But estimate_similarity might return early when the file sizes > differ too much. In that case, cnt_data is never set and the next call to > estimate_similarity will populate the filespec again, eventually rereading > the same blob over and over again. > > To fix that, we first get the blob sizes and only when the blob contents > are actually required, and when cnt_data will be set, the full filespec is > populated, once. I think this is a sane thing to do, and obviously your numbers show an impressive improvement. However, I found your explanation a little confusing. Yes, repeatedly loading those blobs is a problem. But what this is really about is that estimate_similarity has two levels of checks: cheap checks involving the size, and expensive checks involving the content. What is happening now is that we are doing extra work for the expensive checks, even if the cheap checks are going to let us fail early. And that's just stupid, and a waste of processing time even if we _don't_ ever look at the same filespec again. But what makes it even worse is that we have a system in place to cache the expensive work, but because we only do part of it, we don't bother to cache the result. And that's what your patch description is about. So I think your patch is absolutely the right thing to do. But I think from the commit message it isn't clear that it would not be equally correct to follow through on generating cnt_data instead of an early return (which _isn't_ right, because you might not need to generate cnt_data at all). > This actually affects the copy detection way more than the pure > rename detection, due to the larger number of candidates, but it's the > same code, and I only realized that when I reran the stuff to get some > numbers to show off. ;-) I have a pathological real-world rename test case from a long time ago. It's so awful on rename because almost the whole repo was reorganized, but almost every path got its content adjusted, too. I was hoping it would be better with your patch, but it isn't: as it turns out, it is _too_ uniform. The cheap size checks don't work because all of the files are almost the same size (they're all jpgs from a camera). But I think for non-pathological cases, your improvement makes sense. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html