Re: [PATCH] Rename detection: Avoid repeated filespec population

Jeff King <peff@xxxxxxxx> · Tue, 20 Jan 2009 16:27:23 -0500

On Tue, Jan 20, 2009 at 04:59:57PM +0100, Björn Steinbrink wrote:

> In diffcore_rename, we assume that the blob contents in the filespec
> aren't required anymore after estimate_similarity has been called and thus
> we free it. But estimate_similarity might return early when the file sizes
> differ too much. In that case, cnt_data is never set and the next call to
> estimate_similarity will populate the filespec again, eventually rereading
> the same blob over and over again.
> 
> To fix that, we first get the blob sizes and only when the blob contents
> are actually required, and when cnt_data will be set, the full filespec is
> populated, once.

I think this is a sane thing to do, and obviously your numbers show an
impressive improvement.

However, I found your explanation a little confusing. Yes, repeatedly
loading those blobs is a problem. But what this is really about is that
estimate_similarity has two levels of checks: cheap checks involving the
size, and expensive checks involving the content. What is happening now
is that we are doing extra work for the expensive checks, even if the
cheap checks are going to let us fail early.

And that's just stupid, and a waste of processing time even if we
_don't_ ever look at the same filespec again.

But what makes it even worse is that we have a system in place to cache
the expensive work, but because we only do part of it, we don't bother
to cache the result. And that's what your patch description is about.

So I think your patch is absolutely the right thing to do. But I think
from the commit message it isn't clear that it would not be equally
correct to follow through on generating cnt_data instead of an early
return (which _isn't_ right, because you might not need to generate
cnt_data at all).

> This actually affects the copy detection way more than the pure
> rename detection, due to the larger number of candidates, but it's the
> same code, and I only realized that when I reran the stuff to get some
> numbers to show off. ;-)

I have a pathological real-world rename test case from a long time ago.
It's so awful on rename because almost the whole repo was reorganized,
but almost every path got its content adjusted, too. I was hoping it
would be better with your patch, but it isn't: as it turns out, it is
_too_ uniform. The cheap size checks don't work because all of the files
are almost the same size (they're all jpgs from a camera).

But I think for non-pathological cases, your improvement makes sense.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html