Re: similarity index vs. whitespaces

Chris Torek <chris.torek@xxxxxxxxx> · Wed, 12 Apr 2023 19:35:58 -0700

On Wed, Apr 12, 2023 at 7:01 PM Mike Hommey <mh@xxxxxxxxxxxx> wrote:
[example of Python script diff snipped]
> From a human perspective 33% similarity feels way too low here. I know
> it's essentially counting lines in the diff, but that feels limited.

Technically, the similarity index isn't based on lines.  Instead
it's based on byte-by-byte matching, broken into segments (these
do use LF or CRLF to break up segments as appropriate but very
*long* lines get broken up without such line terminators).  This
shares some code with the delta compression algorithm used to
pack objects.  One of the goals here is to consider a lot of
*moved* lines to be "very similar", despite such moves generally
producing large-ish diffs.

CRs are, if I recall correctly, discarded during similarty index
computation.  It would be somewhat easy to discard leading white
space -- and even easier to discard *all* white space, but I'd
suggest that would be wrong -- due to the special case already
in place for line terminators here.

It's not clear to me that discarding leading white space would
be correct in *all* cases, but it does seem very appropriate for
Python code.  Whether the earlier discussion about diff algorithms
being adjusted based on .gitattributes entries might apply here,
I'll leave for others to argue about. :-)  (That is, I'm just
tossing it out as an idea for the mix.)

Chris