On 10/22/07, Andy C <andychup@xxxxxxxxx> wrote: > So the algorithm is: I think I can make this a lot clearer than I did, while glossing over some details and the line_threshold parameter. 1) Make a "left index" and a "right index" out of the 2 sets of files, { line => [list of docs] }. 2) Remove any lines that appear in more than one doc from the left index. Do the same for the right index. (this corresponds to line_threshold=1 case) 3) For all lines, if the line appears in *both* the left index and the right index, increment the count of the (row=doc from left set, column=doc from right set) entry in the similarity matrix by 1. The matrix is represented by a hash of 2-tuples => counts. After this is done for all lines, then the matrix is sparsely filled with the count of common lines between every pair of files in the 2 sets. The vast majority of cells in the matrix are implicitly 0 and thus consume neither memory nor CPU with the hash table representation of matrix. 4) Then you can use this to compute similarity scores. Hopefully that is more clear... though I guess it might not be obvious that it works for the problem that git has. I am fairly sure it does, but the demo should allow us to evaluate that. Andy - To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html