Re: [PATCH] Optimize rename detection for a huge diff

Luke Lu <git@xxxxxxxxxx> · Tue, 29 Jan 2008 23:24:13 -0800

On Jan 29, 2008, at 10:57 PM, Luke Lu wrote:
On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.

This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match.  This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.

This instead allocates and keeps only a handful rename source
candidates per new files in the postimage.  I.e. it makes the
memory requirement from O(N x M) to O(M).

For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.

I can think of cases where you'll throw away better candidates this  
way. How about using a priority queue of size max(N, M)?

I don't know about the details of the current algorithm but it  
seems to me that using a naive Rabin Karp fingerprinting approach  
would not use too much memory: say L is total number of bytes of  
created files and the fingerprint size S and hash size of 4 bytes.  
To keep track of M files You only need to keep 8(additional 4 bytes  
as an index to the file names)*(L/S + M(for filenames)) plus some  
overhead for the hash table in memory. One pass through D (number  
of bytes of deleted files) you can get the NxM scores. The score is  
defined as Wf * Mf + Wt, where Wf is the weight for fingerprinting  
match and Wt is the weight for title match score; Mf is the  
fingerprint match score = (number of matching fingerprints)/(number  
of fingerprints of original (deleted) file). Wf and Wt can be tuned  
to boost exact basename match.

By pushing the scores into a priority queue you'll get the final  
top (max(N, M) = K) scores in the end. The computation complexity  
is really O(D+L+(MxN)logK) and memory requirement O(L)

You can compute the entire linux source tree renaming (24K files  
and total 260MB uncompressed) this way using only about 92MB of  
memory in 18 seconds (limited by hash lookup speed, assuming 15M  
lookups per second based on my past experience).

The estimate is based on fingerprint size of 64 bytes and a 2.4GHz  
C2D class Intel CPU, YMMV. One can trade off the accuracy for less  
memory by using larger fingerprint size and vice versa.

__Luke

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html