Linus Torvalds <torvalds@xxxxxxxx> writes: > Instead of doing a fixed-chunk thing and saying that any copy is > equivalent to any other copy. That's simply not true. It's _much_ better > to have one 24-byte copy than it is to have three 8-byte copies, but the > new faster diffcore-delta.c just can't see that. Exactly. You know what? Once we start counting to detect 24-byte straight copy and try distinguishing it from 3 separate 8-byte copies, it will eventually lead us to what we have in diff-delta.c anyway. I avoided counting runs of bytes on purpose because I wanted to see how far we can go without it. The primary reason I started the jc/diff topic branch was because we _might_ want to replace what is in the current diff-delta.c with much finer-grained comparison code, and when that happens, counting xdelta output for rename detection purpose would have stopped making sense. For now we decided to postpone it for performance reasons, but we still might want to when Nico comes back with a better implementation. Now, I know the current diff-delta based similarity estimator we have in "main" seems to do a reasonable if not perfect job, within a reasonabe amount of time. And it does know how to count copying of consecutive bytes. In the worst case we could just fork the xdelta part of the code when Nico comes back with improved finer-grained delta, and we can keep using the current diff-delta code for rename detection. Knowing we have that fallback position, I wanted to pursue a different avenue. Distinguishing a straight 24-byte run from three independent 8-byte run, using hash to find the offset in the source and actually do maximum string match, is something we already know how to do, because that is essentially what the current diff-delta code does. By the way, the reason the diffcore-delta code in "next" does not do every-eight-bytes hash on the source material is to somewhat alleviate the problem that comes from not detecting copying of consecutive byte ranges. If you have a 8-byte run that is copied from source to destination, we would give it one point (let's for now forget about false match coming from hash collisions). Since the source material is hashed at every byte offset, if we have 9-byte run copied from source to destination, that is awarded two points (for the first 8-byte we award one point, and then another 8-byte sequence starting from the second byte we award another point; we are talking about an overlapping range). That way, the code does reward copying consecutive bytes around more heavily than copying things at random places. At one extreme, if you copy 7-byte, throw in a garbage, another 7-byte, throw in a garbage, and keep going, you would not get any point. It's really a funky heuristics, and as you have seen, it sometimes gives spectaculary phony matches. But in practice, with some tweaking it seems to do an OK job. - : send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html