On Jan 29, 2008, at 10:57 PM, Luke Lu wrote:
On Jan 29, 2008, at 8:40 PM, Junio C Hamano wrote:
When there are N deleted paths and M created paths, we used to
allocate (N x M) "struct diff_score" that record how similar
each of the pair is, and picked the <src,dst> pair that gives
the best match first, and then went on to process worse matches.
This sorting is done so that when two new files in the postimage
that are similar to the same file deleted from the preimage, we
can process the more similar one first, and when processing the
second one, it can notice "Ah, the source I was planning to say
I am a copy of is already taken by somebody else" and continue
on to match itself with another file in the preimage with a
lessor match. This matters to a change introduced between
1.5.3.X series and 1.5.4-rc, that lets the code to favor unused
matches first and then falls back to using already used
matches.
This instead allocates and keeps only a handful rename source
candidates per new files in the postimage. I.e. it makes the
memory requirement from O(N x M) to O(M).
For each dst, we compute similarlity with all sources (i.e. the
number of similarity estimate computations is still O(N x M)),
but we keep handful best src candidates for each dst.
I can think of cases where you'll throw away better candidates this
way. How about using a priority queue of size max(N, M)?
I don't know about the details of the current algorithm but it
seems to me that using a naive Rabin Karp fingerprinting approach
would not use too much memory: say L is total number of bytes of
created files and the fingerprint size S and hash size of 4 bytes.
To keep track of M files You only need to keep 8(additional 4 bytes
as an index to the file names)*(L/S + M(for filenames)) plus some
overhead for the hash table in memory. One pass through D (number
of bytes of deleted files) you can get the NxM scores. The score is
defined as Wf * Mf + Wt, where Wf is the weight for fingerprinting
match and Wt is the weight for title match score; Mf is the
fingerprint match score = (number of matching fingerprints)/(number
of fingerprints of original (deleted) file). Wf and Wt can be tuned
to boost exact basename match.
By pushing the scores into a priority queue you'll get the final
top (max(N, M) = K) scores in the end. The computation complexity
is really O(D+L+(MxN)logK) and memory requirement O(L)
You can compute the entire linux source tree renaming (24K files
and total 260MB uncompressed) this way using only about 92MB of
memory in 18 seconds (limited by hash lookup speed, assuming 15M
lookups per second based on my past experience).
The estimate is based on fingerprint size of 64 bytes and a 2.4GHz
C2D class Intel CPU, YMMV. One can trade off the accuracy for less
memory by using larger fingerprint size and vice versa.
__Luke
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html