Re: [PATCH] Optimize rename detection for a huge diff

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Jan 29, 2008, at 10:57 PM, Luke Lu wrote:
On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.

This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match.  This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.

This instead allocates and keeps only a handful rename source
candidates per new files in the postimage.  I.e. it makes the
memory requirement from O(N x M) to O(M).

For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.

I can think of cases where you'll throw away better candidates this way. How about using a priority queue of size max(N, M)?

I don't know about the details of the current algorithm but it seems to me that using a naive Rabin Karp fingerprinting approach would not use too much memory: say L is total number of bytes of created files and the fingerprint size S and hash size of 4 bytes. To keep track of M files You only need to keep 8(additional 4 bytes as an index to the file names)*(L/S + M(for filenames)) plus some overhead for the hash table in memory. One pass through D (number of bytes of deleted files) you can get the NxM scores. The score is defined as Wf * Mf + Wt, where Wf is the weight for fingerprinting match and Wt is the weight for title match score; Mf is the fingerprint match score = (number of matching fingerprints)/(number of fingerprints of original (deleted) file). Wf and Wt can be tuned to boost exact basename match.

By pushing the scores into a priority queue you'll get the final top (max(N, M) = K) scores in the end. The computation complexity is really O(D+L+(MxN)logK) and memory requirement O(L)

You can compute the entire linux source tree renaming (24K files and total 260MB uncompressed) this way using only about 92MB of memory in 18 seconds (limited by hash lookup speed, assuming 15M lookups per second based on my past experience).

The estimate is based on fingerprint size of 64 bytes and a 2.4GHz C2D class Intel CPU, YMMV. One can trade off the accuracy for less memory by using larger fingerprint size and vice versa.

__Luke

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux