On Mon, Apr 12, 2010 at 11:42:59PM +0530, Debayan Banerjee wrote: > I was going through this thread > <http://markmail.org/message/mge2spy7uqle5ghy#query:diffcore-rename.c%20algorithm+page:1+mid:ji7jkzypzqpml2xn+state:results> > trying to understand how the similarity scoring for modified files > works. I see that the algorithm uses two hashes, one for each file, > and proceeds to compare the 2 hashes to determine percentage > similarity. > I think I have an algorithm which uses only one hash and has a worst > case space complexity of O(N) where N is the number of lines > inserted/deleted/moved. That looks pretty similar to the algorithm that Andy Chup proposed. I did some work towards integrating it, but never completed it. See these threads for details: http://thread.gmane.org/gmane.comp.version-control.git/61975 http://thread.gmane.org/gmane.comp.version-control.git/62655 So yes, it's an interesting way to go. But it needs somebody to: 1. Implement it in C and integrate it into diffcore-rename (I did that, but there were some problems, see below). 2. Confirm that the quality of results is similar to the current algorithm. In my case, the results were different, and I suggested some tweaks to improve them (see the second thread above). I don't remember if I ever followed up. 3. Confirm that we really do get a speed improvement on the slow cases, and don't slow down or significantly bloat memory usage for the fast cases. If you're interested in working on it, I'd be happy to help with discussion and code review. -Peff -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html