Re: Linear time/space rename logic for *inexact* case

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 10/22/07, Andy C <andychup@xxxxxxxxx> wrote:
> So the algorithm is:

I think I can make this a lot clearer than I did, while glossing over
some details and the line_threshold parameter.

1) Make a "left index" and a "right index" out of the 2 sets of files,
{ line => [list of docs] }.

2) Remove any lines that appear in more than one doc from the left
index.  Do the same for the right index.  (this corresponds to
line_threshold=1 case)

3) For all lines, if the line appears in *both* the left index and the
right index, increment the count of the (row=doc from left set,
column=doc from right set) entry in the similarity matrix by 1.  The
matrix is represented by a hash of 2-tuples => counts.

After this is done for all lines, then the matrix is sparsely filled
with the count of common lines between every pair of files in the 2
sets.  The vast majority of cells in the matrix are implicitly 0 and
thus consume neither memory nor CPU with the hash table representation
of matrix.

4) Then you can use this to compute similarity scores.

Hopefully that is more clear... though I guess it might not be obvious
that it works for the problem that git has.  I am fairly sure it does,
but the demo should allow us to evaluate that.

Andy
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux