Re: [PATCH] Optimize rename detection for a huge diff

Luke Lu <git@xxxxxxxxxx> · Tue, 29 Jan 2008 22:57:28 -0800

On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.

This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match.  This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.

This instead allocates and keeps only a handful rename source
candidates per new files in the postimage.  I.e. it makes the
memory requirement from O(N x M) to O(M).

For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.

I can think of cases where you'll throw away better candidates this  
way. How about using a priority queue of size max(N, M)?

I don't know about the details of the current algorithm but it seems  
to me that using a naive Rabin Karp fingerprinting approach would not  
use too much memory: say L is total number of bytes of created files  
and the fingerprint size S and hash size of 4 bytes. To keep track of  
M files You only need to keep 8(additional 4 bytes as an index to the  
file names)*(L/S + M(for filenames)) plus some overhead for the hash  
table in memory. One pass through D (number of bytes of deleted  
files) you can get the NxM scores. The score is defined as Wf * Mf +  
Wt, where Wf is the weight for fingerprinting match and Wt is the  
weight for title match score; Mf is the fingerprint match score =  
(number of matching fingerprints)/(number of fingerprints of original  
(deleted) file). Wf and Wt can be tuned to boost exact basename match.

By pushing the scores into a priority queue you'll get the final top  
(max(N, M) = K) scores in the end. The computation complexity is  
really O(D+L+(MxN)logK) and memory requirement O(L)

You can compute the entire linux source tree renaming (24K files and  
total 260MB uncompressed) this way using only about 92MB of memory in  
18 seconds (limited by hash lookup speed, assuming 15M lookups per  
second based on my past experience).

__Luke
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html