Elijah Newren <newren@xxxxxxxxx> writes: > 3) It uses a similarity measure that diverges from what researches > used for MinHash and other fancy algorithms. In particular, > > size(A intersect B) / size(A union B) != size(A intersect B) / > max(size(A), size(B)) > > The formula on the right hand side would mean that if file A is a > subset of file B, say the first 10% of file B, then it will be treated > as 100% similar when most humans would look at it and say it is only > 10% similar. If you are talking about "you start from 100 lines file and appended 900 lines of your own, then you still have 100% of the original material remaining in the file", it is quite deliberate that we used it as an indication that the original "100 lines" file is a good candidate to have been renamed to the resulting "1000 lines" file. It is "what you have kept from the original" measure. Of course, taken to the extreme, this means that rename does not have to be symmetrical. "diff A B" may find that the original 100-line file in A has grown into 1000-line file in B elsewhere, but "diff B A" or "diff -R A B" would not necessarily pair these two blobs as matching. > Maybe the performance gains I'm adding elsewhere will offset possible > grumpy users? Users, as they are, it would never happen. When they have something to complain about, they will, regardless of what else you do.