On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
When there are N deleted paths and M created paths, we used to allocate (N x M) "struct diff_score" that record how similar each of the pair is, and picked the <src,dst> pair that gives the best match first, and then went on to process worse matches. This sorting is done so that when two new files in the postimage that are similar to the same file deleted from the preimage, we can process the more similar one first, and when processing the second one, it can notice "Ah, the source I was planning to say I am a copy of is already taken by somebody else" and continue on to match itself with another file in the preimage with a lessor match. This matters to a change introduced between 1.5.3.X series and 1.5.4-rc, that lets the code to favor unused matches first and then falls back to using already used matches. This instead allocates and keeps only a handful rename source candidates per new files in the postimage. I.e. it makes the memory requirement from O(N x M) to O(M). For each dst, we compute similarlity with all sources (i.e. the number of similarity estimate computations is still O(N x M)), but we keep handful best src candidates for each dst.
I can think of cases where you'll throw away better candidates this way. How about using a priority queue of size max(N, M)?
I don't know about the details of the current algorithm but it seems to me that using a naive Rabin Karp fingerprinting approach would not use too much memory: say L is total number of bytes of created files and the fingerprint size S and hash size of 4 bytes. To keep track of M files You only need to keep 8(additional 4 bytes as an index to the file names)*(L/S + M(for filenames)) plus some overhead for the hash table in memory. One pass through D (number of bytes of deleted files) you can get the NxM scores. The score is defined as Wf * Mf + Wt, where Wf is the weight for fingerprinting match and Wt is the weight for title match score; Mf is the fingerprint match score = (number of matching fingerprints)/(number of fingerprints of original (deleted) file). Wf and Wt can be tuned to boost exact basename match.
By pushing the scores into a priority queue you'll get the final top (max(N, M) = K) scores in the end. The computation complexity is really O(D+L+(MxN)logK) and memory requirement O(L)
You can compute the entire linux source tree renaming (24K files and total 260MB uncompressed) this way using only about 92MB of memory in 18 seconds (limited by hash lookup speed, assuming 15M lookups per second based on my past experience).
__Luke - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html