Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>
> Derrick Stolee <stolee@xxxxxxxxx> writes:
>
> > Before I get too deep into reviewing these patches, I do want
> > to make it clear that the speed-up is coming at the cost of
> > a behavior change. We are restricting the "best match" search
> > to be first among files with common base name (although maybe
> > I would use 'suffix'?). If we search for a rename among all
> > additions and deletions ending the ".txt" we might find a
> > similarity match that is 60% and declare that a rename, even
> > if there is a ".txt" -> ".md" pair that has a 70% match.
>
> Yes, my initial reaction to the idea was that "yuck, our rename
> detection lost its purity".  diffcore-rename strived to base its
> decision purely on content similarity, primarily because it is one
> of the oldest part of Git where the guiding principle has always
> been that the content is the king.  I think my aversion to the "all
> of my neighbors are relocating, so should I move to the same place"
> (aka "directory rename") comes from a similar place, but in a sense
> this was worse.
>
> At least, until I got over the initial bump.  I do not think the
> suffix match is necessarily a bad idea, but it adds more "magically
> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
> would probably have more variants that impact the real life
> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
> mind), and a tool that silently does a wrong thing because it uses
> more magic would be a tool that is hard to explain why it did the
> wrong thing when it does.

Stolee explained a new algorithm different than what I have proposed,
I think based on the apparent different meanings of "basename" that
exist.  I tried to clarify that in response to his email, but I wanted
to clarify one additional thing here too:

diffcore-rename has not in the past based its decision solely on
content similarity.  It only does that when break detection is on.
Otherwise, 0% content similarity is trumped by sufficient filename
similarity (namely, with a filename similarity of 100%).  If the
filename similarity wasn't sufficiently high (anything less than an
exact match), then it completely ignored filename similarity and
looked only at content similarity.  It thus jumped from one extreme to
another.

My optimization is adding an in-between state.  When the basename (the
part of the path excluding the leading directory) matches the basename
of another file (and those basenames are unique on each side), then
compare content similarity and mark the files as a rename if the two
are sufficiently similar.  It is thus a position that considers both
filename similarity (basename match) and content similarity together.

> > This could be documented in a test case, to demonstrate that
> > we are making this choice explicitly.
>
> Yes.  I wonder if we can solve it by requiring a lot better than
> minimum match when trying the "suffix match" first, or something?

This may still be a useful idea, and was something I had considered,
but more in the context of more generic filename similarity
comparisons.  We could still discuss it even when basenames match, but
basenames matching seems strong enough to me that I wasn't sure extra
configuration knobs were warranted.

> Provided if we agree that it is a good idea to insert this between
> "exact contents match" and "full matrix", I have one question to
> Elijah on what the code does.
>
> To me, it seems that the "full matrix" part still uses the remaining
> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
> surviving in the src/dst at that stage, shouldn't we be saying that
> "no way these can be similar enough---we've checked in the middle
> stage where only the ones with the same suffix are considered and
> this pair didn't turn into a rename"?

This is a very good point.  A.txt and B.txt will not have been
compared previously since their basenames do not match, but the basic
idea is still possible.  For example, A.txt could have been compared
to source/some-module/A.txt.  And I don't do anything in the final
"full matrix" stage to avoid re-comparing those two files again.
However, it is worth noting that A.txt will have been compared to at
most one other file, not N files.  And thus while we are wasting some
re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
I thought about that, but decided to not bother, based on the
following thinking:

1) The most expensive comparison is the first one, because when we do
that one, we first have to populate the list of integers that lines in
the file hash to.  Subsequent comparisons are relatively cheap since
this list of integers has already been computed.

2) This would only save us from at most N comparisons in the N x M
matrix (since no file in this optimization is compared to more than
one other)

3) Checking if two files have previously been compared requires more
code, in what is already a tight nested loop.  My experience
attempting to modify that tight loop for extra conditions (e.g. don't
compare files that are too large), is that it's easy to accidentally
make the code slower.  In fact, this is in part what led to the
addition of the remove_unneed_paths_from_src() function.

4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)

I think removing these already-compared cases could be done, but I
just avoided it.  If we were to do the "attempt to match files with
the same extension" optimization that Stolee outlines/invents above,
then we'd definitely need to consider it.  Otherwise, it's just a
minor additional optimization that someone could add to my patches.



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux