On Wed, Apr 12, 2023 at 7:01 PM Mike Hommey <mh@xxxxxxxxxxxx> wrote: [example of Python script diff snipped] > From a human perspective 33% similarity feels way too low here. I know > it's essentially counting lines in the diff, but that feels limited. Technically, the similarity index isn't based on lines. Instead it's based on byte-by-byte matching, broken into segments (these do use LF or CRLF to break up segments as appropriate but very *long* lines get broken up without such line terminators). This shares some code with the delta compression algorithm used to pack objects. One of the goals here is to consider a lot of *moved* lines to be "very similar", despite such moves generally producing large-ish diffs. CRs are, if I recall correctly, discarded during similarty index computation. It would be somewhat easy to discard leading white space -- and even easier to discard *all* white space, but I'd suggest that would be wrong -- due to the special case already in place for line terminators here. It's not clear to me that discarding leading white space would be correct in *all* cases, but it does seem very appropriate for Python code. Whether the earlier discussion about diff algorithms being adjusted based on .gitattributes entries might apply here, I'll leave for others to argue about. :-) (That is, I'm just tossing it out as an idea for the mix.) Chris