Re: [PATCH v2 0/2] Optimization batch 6: make full use of exact renames

Junio C Hamano <gitster@xxxxxxxxx> · Wed, 03 Feb 2021 15:26:13 -0800

Elijah Newren <newren@xxxxxxxxx> writes:

> 3) It uses a similarity measure that diverges from what researches
> used for MinHash and other fancy algorithms.  In particular,
>
>    size(A intersect B) / size(A union B)  != size(A intersect B) /
> max(size(A), size(B))
>
> The formula on the right hand side would mean that if file A is a
> subset of file B, say the first 10% of file B, then it will be treated
> as 100% similar when most humans would look at it and say it is only
> 10% similar.

If you are talking about "you start from 100 lines file and appended
900 lines of your own, then you still have 100% of the original
material remaining in the file", it is quite deliberate that we used
it as an indication that the original "100 lines" file is a good
candidate to have been renamed to the resulting "1000 lines" file.
It is "what you have kept from the original" measure.

Of course, taken to the extreme, this means that rename does not
have to be symmetrical.  "diff A B" may find that the original
100-line file in A has grown into 1000-line file in B elsewhere, but
"diff B A" or "diff -R A B" would not necessarily pair these two
blobs as matching.

> Maybe the performance gains I'm adding elsewhere will offset possible
> grumpy users?

Users, as they are, it would never happen.  When they have something
to complain about, they will, regardless of what else you do.