Re: Rename similarity scoring for file pair

Jeff King <peff@xxxxxxxx> · Sat, 17 Apr 2010 07:03:56 -0400

On Mon, Apr 12, 2010 at 11:42:59PM +0530, Debayan Banerjee wrote:

> I was going through this thread
> <http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results>
> trying to understand how the similarity scoring for modified files
> works. I see that the algorithm uses two hashes, one for each file,
> and proceeds to compare the 2 hashes to determine percentage
> similarity.
> I think I have an algorithm which uses only one hash and has a worst
> case space complexity of O(N) where N is the number of lines
> inserted/deleted/moved.

That looks pretty similar to the algorithm that Andy Chup proposed. I
did some work towards integrating it, but never completed it. See these
threads for details:

  http://thread.gmane.org/gmane.comp.version-control.git/61975

  http://thread.gmane.org/gmane.comp.version-control.git/62655

So yes, it's an interesting way to go. But it needs somebody to:

  1. Implement it in C and integrate it into diffcore-rename (I did
     that, but there were some problems, see below).

  2. Confirm that the quality of results is similar to the current
     algorithm. In my case, the results were different, and I suggested
     some tweaks to improve them (see the second thread above). I don't
     remember if I ever followed up.

  3. Confirm that we really do get a speed improvement on the slow
     cases, and don't slow down or significantly bloat memory usage for
     the fast cases.

If you're interested in working on it, I'd be happy to help with
discussion and code review.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html