Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

Elijah Newren <newren@xxxxxxxxx> · Mon, 8 Feb 2021 08:25:00 -0800

On Mon, Feb 8, 2021 at 3:44 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 2/8/2021 3:38 AM, Elijah Newren wrote:
> > On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
> >>
> >> Derrick Stolee <stolee@xxxxxxxxx> writes:
> >>
> >>> Before I get too deep into reviewing these patches, I do want
> >>> to make it clear that the speed-up is coming at the cost of
> >>> a behavior change. We are restricting the "best match" search
> >>> to be first among files with common base name (although maybe
> >>> I would use 'suffix'?). If we search for a rename among all
> >>> additions and deletions ending the ".txt" we might find a
> >>> similarity match that is 60% and declare that a rename, even
> >>> if there is a ".txt" -> ".md" pair that has a 70% match.
> >>
> >> Yes, my initial reaction to the idea was that "yuck, our rename
> >> detection lost its purity".  diffcore-rename strived to base its
> >> decision purely on content similarity, primarily because it is one
> >> of the oldest part of Git where the guiding principle has always
> >> been that the content is the king.  I think my aversion to the "all
> >> of my neighbors are relocating, so should I move to the same place"
> >> (aka "directory rename") comes from a similar place, but in a sense
> >> this was worse.
> >>
> >> At least, until I got over the initial bump.  I do not think the
> >> suffix match is necessarily a bad idea, but it adds more "magically
> >> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
> >> would probably have more variants that impact the real life
> >> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
> >> mind), and a tool that silently does a wrong thing because it uses
> >> more magic would be a tool that is hard to explain why it did the
> >> wrong thing when it does.
> >
> > Stolee explained a new algorithm different than what I have proposed,
>
> Yes, sorry for adding noise. The point stands that we are changing
> the behavior in some cases, so that must be agreed upon. What you
> are _actually_ proposing is a much smaller change than I thought,
> but it is still worth pointing out the behavior change.

Again, no need to apologize; if my commit messages weren't clear, they
need to be fixed.  I am much more surprised here by your repeated
point that it's worth pointing out the behavior change.  I totally
agree with that, and it's why I spent two paragraphs on the second
commit message explicitly covering this and listing four enumerated
reasons to argue for the change.  So, I thought I had done that, but
it apparently isn't very clear...and I'm left wondering how to clarify
it further.  So, a question for you: How should I change it?  Should I
just modify the third commit message to re-highlight that it does
change behavior and refer to the second commit message for details?
Because repeating the point is the only way I can think of to make it
clearer.  Is there anything else I can or should do?

> > I think based on the apparent different meanings of "basename" that
> > exist.  I tried to clarify that in response to his email, but I wanted
> > to clarify one additional thing here too:
> >> diffcore-rename has not in the past based its decision solely on
> > content similarity.  It only does that when break detection is on.
> > Otherwise, 0% content similarity is trumped by sufficient filename
> > similarity (namely, with a filename similarity of 100%).  If the
> > filename similarity wasn't sufficiently high (anything less than an
> > exact match), then it completely ignored filename similarity and
> > looked only at content similarity.  It thus jumped from one extreme to
> > another.
>
> This idea of optimizing first for 100% filename similarity is a
> good perspective on Git's rename detection algorithm. The canonical
> example of this 100% filename similarity is a rename cycle:
>
>         A -> B
>         B -> C
>         C -> A
>
> Even if the OIDs are distinct and exactly match across these renames,
> we see that there are no adds or deletes, so we do not even trigger
> rename detection and report A, B, and C as edited instead.
>
> A "rename path" (not cycle) such as:
>
>         A -> B
>         B -> C
>
> does trigger rename detection, but B will never be considered. Instead,
> "A -> C" will be checked for similarity to see if it is within the
> threshold.
>
> Of course, I am _not_ advocating that we change this behavior. These
> situations are incredibly rare and we should not sacrifice performance
> in the typical case to handle them.
>
> > My optimization is adding an in-between state.  When the basename (the
> > part of the path excluding the leading directory) matches the basename
> > of another file (and those basenames are unique on each side), then
> > compare content similarity and mark the files as a rename if the two
> > are sufficiently similar.  It is thus a position that considers both
> > filename similarity (basename match) and content similarity together.
> >
> >>> This could be documented in a test case, to demonstrate that
> >>> we are making this choice explicitly.
> >>
> >> Yes.  I wonder if we can solve it by requiring a lot better than
> >> minimum match when trying the "suffix match" first, or something?
> >
> > This may still be a useful idea, and was something I had considered,
> > but more in the context of more generic filename similarity
> > comparisons.  We could still discuss it even when basenames match, but
> > basenames matching seems strong enough to me that I wasn't sure extra
> > configuration knobs were warranted.
>
> I think this is a complication that we might not want to add to the
> heuristic, at least not at first. We might want to have a follow-up
> that adjusts that value to be higher. A natural way would be through
> a config option, so users can select something incredibly high like
> 99%. Another option would be to take a minimum that is halfway between
> the existing similarity minimum and 100%.
>
> >> Provided if we agree that it is a good idea to insert this between
> >> "exact contents match" and "full matrix", I have one question to
> >> Elijah on what the code does.
> >>
> >> To me, it seems that the "full matrix" part still uses the remaining
> >> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
> >> surviving in the src/dst at that stage, shouldn't we be saying that
> >> "no way these can be similar enough---we've checked in the middle
> >> stage where only the ones with the same suffix are considered and
> >> this pair didn't turn into a rename"?
> >
> > This is a very good point.  A.txt and B.txt will not have been
> > compared previously since their basenames do not match, but the basic
> > idea is still possible.  For example, A.txt could have been compared
> > to source/some-module/A.txt.  And I don't do anything in the final
> > "full matrix" stage to avoid re-comparing those two files again.
> > However, it is worth noting that A.txt will have been compared to at
> > most one other file, not N files.  And thus while we are wasting some
> > re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
> > I thought about that, but decided to not bother, based on the
> > following thinking:
> >
> > 1) The most expensive comparison is the first one, because when we do
> > that one, we first have to populate the list of integers that lines in
> > the file hash to.  Subsequent comparisons are relatively cheap since
> > this list of integers has already been computed.
> >
> > 2) This would only save us from at most N comparisons in the N x M
> > matrix (since no file in this optimization is compared to more than
> > one other)
> >
> > 3) Checking if two files have previously been compared requires more
> > code, in what is already a tight nested loop.  My experience
> > attempting to modify that tight loop for extra conditions (e.g. don't
> > compare files that are too large), is that it's easy to accidentally
> > make the code slower.  In fact, this is in part what led to the
> > addition of the remove_unneed_paths_from_src() function.
>
> Even storing a single bit to say "these were already compared" takes
> quadratic space. The hope is to not have quadratic behavior if it
> can be avoided.
>
> > 4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)
> >
> > I think removing these already-compared cases could be done, but I
> > just avoided it.  If we were to do the "attempt to match files with
> > the same extension" optimization that Stolee outlines/invents above,
> > then we'd definitely need to consider it.  Otherwise, it's just a
> > minor additional optimization that someone could add to my patches.
>
> The more I think about it, the less my idea makes sense. I'm sorry
> for adding noise to the thread.
>
> Thanks,
> -Stolee