Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

Junio C Hamano <gitster@xxxxxxxxx> · Sun, 07 Feb 2021 11:51:05 -0800

Derrick Stolee <stolee@xxxxxxxxx> writes:

> Before I get too deep into reviewing these patches, I do want
> to make it clear that the speed-up is coming at the cost of
> a behavior change. We are restricting the "best match" search
> to be first among files with common base name (although maybe
> I would use 'suffix'?). If we search for a rename among all
> additions and deletions ending the ".txt" we might find a
> similarity match that is 60% and declare that a rename, even
> if there is a ".txt" -> ".md" pair that has a 70% match.

Yes, my initial reaction to the idea was that "yuck, our rename
detection lost its purity".  diffcore-rename strived to base its
decision purely on content similarity, primarily because it is one
of the oldest part of Git where the guiding principle has always
been that the content is the king.  I think my aversion to the "all
of my neighbors are relocating, so should I move to the same place"
(aka "directory rename") comes from a similar place, but in a sense
this was worse.

At least, until I got over the initial bump.  I do not think the
suffix match is necessarily a bad idea, but it adds more "magically
doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
would probably have more variants that impact the real life
projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
mind), and a tool that silently does a wrong thing because it uses
more magic would be a tool that is hard to explain why it did the
wrong thing when it does.

> This could be documented in a test case, to demonstrate that
> we are making this choice explicitly.

Yes.  I wonder if we can solve it by requiring a lot better than
minimum match when trying the "suffix match" first, or something?

Provided if we agree that it is a good idea to insert this between
"exact contents match" and "full matrix", I have one question to
Elijah on what the code does.

To me, it seems that the "full matrix" part still uses the remaining
src and dst candidates fully.  But if "A.txt" and "B.txt" are still
surviving in the src/dst at that stage, shouldn't we be saying that
"no way these can be similar enough---we've checked in the middle
stage where only the ones with the same suffix are considered and
this pair didn't turn into a rename"?

Thanks.