Re: Fix up diffcore-rename scoring

Junio C Hamano <junkio@xxxxxxx> · Mon, 13 Mar 2006 02:43:15 -0800

Linus Torvalds <torvalds@xxxxxxxx> writes:

> Instead of doing a fixed-chunk thing and saying that any copy is 
> equivalent to any other copy. That's simply not true. It's _much_ better 
> to have one 24-byte copy than it is to have three 8-byte copies, but the 
> new faster diffcore-delta.c just can't see that.

Exactly.

You know what?  Once we start counting to detect 24-byte
straight copy and try distinguishing it from 3 separate 8-byte
copies, it will eventually lead us to what we have in
diff-delta.c anyway.  I avoided counting runs of bytes on
purpose because I wanted to see how far we can go without it.

The primary reason I started the jc/diff topic branch was
because we _might_ want to replace what is in the current
diff-delta.c with much finer-grained comparison code, and when
that happens, counting xdelta output for rename detection
purpose would have stopped making sense.  For now we decided to
postpone it for performance reasons, but we still might want to
when Nico comes back with a better implementation.

Now, I know the current diff-delta based similarity estimator we
have in "main" seems to do a reasonable if not perfect job,
within a reasonabe amount of time.  And it does know how to
count copying of consecutive bytes.  In the worst case we could
just fork the xdelta part of the code when Nico comes back with
improved finer-grained delta, and we can keep using the current
diff-delta code for rename detection.  Knowing we have that
fallback position, I wanted to pursue a different avenue.
Distinguishing a straight 24-byte run from three independent
8-byte run, using hash to find the offset in the source and
actually do maximum string match, is something we already know
how to do, because that is essentially what the current
diff-delta code does.

By the way, the reason the diffcore-delta code in "next" does
not do every-eight-bytes hash on the source material is to
somewhat alleviate the problem that comes from not detecting
copying of consecutive byte ranges.  If you have a 8-byte run
that is copied from source to destination, we would give it one
point (let's for now forget about false match coming from hash
collisions).  Since the source material is hashed at every byte
offset, if we have 9-byte run copied from source to destination,
that is awarded two points (for the first 8-byte we award one
point, and then another 8-byte sequence starting from the second
byte we award another point; we are talking about an overlapping
range).  That way, the code does reward copying consecutive
bytes around more heavily than copying things at random places.
At one extreme, if you copy 7-byte, throw in a garbage, another
7-byte, throw in a garbage, and keep going, you would not get
any point.

It's really a funky heuristics, and as you have seen, it
sometimes gives spectaculary phony matches.  But in practice,
with some tweaking it seems to do an OK job.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html