Re: Fix up diffcore-rename scoring

Linus Torvalds <torvalds@xxxxxxxx> · Mon, 13 Mar 2006 07:38:53 -0800 (PST)

On Mon, 13 Mar 2006, Junio C Hamano wrote:
> 
> By the way, the reason the diffcore-delta code in "next" does
> not do every-eight-bytes hash on the source material is to
> somewhat alleviate the problem that comes from not detecting
> copying of consecutive byte ranges.

Yes. However, there are better ways to do that in practice.

The most effective way that is generally used is to not use a fixed 
chunk-size, but use a terminating character, together with a 
minimum/maximum chunksize.

There's a pretty natural terminating character that works well for 
sources: '\n'.

So the natural way to do similarity detection when most of the code is 
line-based is to do the hashing on chunks that follow the rule "minimum of 
<n> bytes, maximum of <2*n> bytes, try to begin/end at a \n".

So if you don't see any '\n' at all (or the only such one is less than <n> 
bytes into your current window), do the hash over a <2n>-byte chunk (this 
takes care of binaries and/or long lines).

This - for source code - allows you to ignore trivial byte offset things, 
because you have a character that is used for synchronization. So you 
don't need to do hashing at every byte in both files - you end up doing 
the hashing only at line boundaries in practice. And it still _works_ for 
binary files, although you effectively need bigger identical chunk-sizes 
to find similarities (for text-files, it finds similarities of size <n>, 
for binaries the similarities need to effectively be of size 3*n, because 
you chunk it up at ~2*n, and only generate the hash at certain offsets in 
the source binary).

		Linus
-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html