Re: Fix up diffcore-rename scoring

Junio C Hamano <junkio@xxxxxxx> · Tue, 11 Apr 2006 15:04:17 -0700

Geert Bosch <bosch@xxxxxxxxxxx> writes:

> On Mar 13, 2006, at 02:44, Linus Torvalds wrote:
>
>> It might be that the fast delta thing is a good way to ask
>> "is this even worth considering", to cut down the O(m*n)
>> rename/copy detection to something much smaller, and then use
>> xdelta() to actually figure out what is a good rename and
>> what isn't from a much smaller set of potential targets.
>
>
> Here's a possible way to do that first cut. Basically,
> compute a short (256-bit) fingerprint for each file, such
> that the Hamming distance between two fingerprints is a measure
> for their similarity. I'll include a draft write up below.

Thanks for starting this.

There are a few things I need to talk about the way "similarity"
is _used_ in the current algorithms.

Rename/copy detection outputs "similarity" but I suspect what
the algorithm wants is slightly different from what humans think
of "similarity".  It is somewhere between "similarity" and
"commonness".  When you are grading a 130-page report a student
submitted, you would want to notice that last 30 pages are
almost verbatim copy from somebody else's report.  The student
in question added 100-page original contents so maybe this is
not too bad, but if the report were a 30-page one, and the
entier 30 pages were borrowed from somebody else's 130-page
report, you would _really_ want to notice.

While reorganizaing a program, a nontrivial amount of text is
often removed from an existing file and moved to a newly created
file.  Right now, the way similarity score is calculated has a
heuristical cap to reject two files whose sizes are very
different, but to detect and show this kind of file split, the
sizes of files should matter less.

On the other hand, taking this "commonness matters" to its
extreme is not what we want.  We are producing "diff", so if a
30-line new file was created by moving these 30 lines from
originally 130-line file (which is now 100 lines long), showing
it as "copy from the-130-line-file" with a diff to remove
100-lines is usually not what we want.  That's why the size cap
makes sense in rename similarity estimator.

Another place we use "similarity" is to break a file that got
modified too much.  This is done for two independent purposes.

One is to detect a case like this:

	mv B C
        mv A B
        small-edit B

File B's content is not related to what it had originally, but
is derived from what was originally in A.  Usually rename/copy
detection tries to find rename/copy into files that _disappear_
from the result, but with the above sequence, B never
disappears.  By looking at how dissimilar the preimage and
postimage of B are, we tell the rename/copy detector that B,
although it does not disappear, might have been renamed/copied
from somewhere else.

Another is to present the final diff output as a complete
rewrite.  When -B (break) is used without -M (rename) or -C
(copy), or a file that got a lot of edit and got "broken" turned
out to be purely a total edit (i.e. not renamed/copied from
somewhere else), we would present it as diff output that has
only one hunk, with bunch of '-' (removal) to remove all
original contents first and then '+' (addition) to add all the
new contents, which is often easier to read than ordinary
unidiff between two unrelated contents that matches up lines
that happen to be the same.  Empirically, it seems to give
better result if the "similarity" threshold to "break" a file
(i.e. to consider it might have been renamed/copied from
somewhere else) is set lower than the threashold to show the
diff as a complete rewrite patch.

Also we can make commonness matter even more in the similarlity
used to "break" a file than rename detector, because if we are
going to break it, we will not have to worry about the issue of
showing an annoying diff that removes 100 lines after copying a
130-line file.  This implies that the break algorithm needs to
use two different kinds of similarity, one for breaking and then
another for deciding how to show the broken pieces as a diff.

Sorry if this write-up does not make much sense.  It ended up
being a lot more incoherent than I hoped it to be.

Anyway, sometime this week I'll find time to play with your code
myself.

-
: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html