Re: [PATCH 3/3] diffcore-rename: guide inexact rename detection based on basenames

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 2/8/2021 3:38 AM, Elijah Newren wrote:
> On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@xxxxxxxxx> wrote:
>>
>> Derrick Stolee <stolee@xxxxxxxxx> writes:
>>
>>> Before I get too deep into reviewing these patches, I do want
>>> to make it clear that the speed-up is coming at the cost of
>>> a behavior change. We are restricting the "best match" search
>>> to be first among files with common base name (although maybe
>>> I would use 'suffix'?). If we search for a rename among all
>>> additions and deletions ending the ".txt" we might find a
>>> similarity match that is 60% and declare that a rename, even
>>> if there is a ".txt" -> ".md" pair that has a 70% match.
>>
>> Yes, my initial reaction to the idea was that "yuck, our rename
>> detection lost its purity".  diffcore-rename strived to base its
>> decision purely on content similarity, primarily because it is one
>> of the oldest part of Git where the guiding principle has always
>> been that the content is the king.  I think my aversion to the "all
>> of my neighbors are relocating, so should I move to the same place"
>> (aka "directory rename") comes from a similar place, but in a sense
>> this was worse.
>>
>> At least, until I got over the initial bump.  I do not think the
>> suffix match is necessarily a bad idea, but it adds more "magically
>> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example
>> would probably have more variants that impact the real life
>> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to
>> mind), and a tool that silently does a wrong thing because it uses
>> more magic would be a tool that is hard to explain why it did the
>> wrong thing when it does.
> 
> Stolee explained a new algorithm different than what I have proposed,

Yes, sorry for adding noise. The point stands that we are changing
the behavior in some cases, so that must be agreed upon. What you
are _actually_ proposing is a much smaller change than I thought,
but it is still worth pointing out the behavior change.

> I think based on the apparent different meanings of "basename" that
> exist.  I tried to clarify that in response to his email, but I wanted
> to clarify one additional thing here too:
>> diffcore-rename has not in the past based its decision solely on
> content similarity.  It only does that when break detection is on.
> Otherwise, 0% content similarity is trumped by sufficient filename
> similarity (namely, with a filename similarity of 100%).  If the
> filename similarity wasn't sufficiently high (anything less than an
> exact match), then it completely ignored filename similarity and
> looked only at content similarity.  It thus jumped from one extreme to
> another.

This idea of optimizing first for 100% filename similarity is a
good perspective on Git's rename detection algorithm. The canonical
example of this 100% filename similarity is a rename cycle:

	A -> B
	B -> C
	C -> A

Even if the OIDs are distinct and exactly match across these renames,
we see that there are no adds or deletes, so we do not even trigger
rename detection and report A, B, and C as edited instead.

A "rename path" (not cycle) such as:

	A -> B
	B -> C

does trigger rename detection, but B will never be considered. Instead,
"A -> C" will be checked for similarity to see if it is within the
threshold.

Of course, I am _not_ advocating that we change this behavior. These
situations are incredibly rare and we should not sacrifice performance
in the typical case to handle them.

> My optimization is adding an in-between state.  When the basename (the
> part of the path excluding the leading directory) matches the basename
> of another file (and those basenames are unique on each side), then
> compare content similarity and mark the files as a rename if the two
> are sufficiently similar.  It is thus a position that considers both
> filename similarity (basename match) and content similarity together.
> 
>>> This could be documented in a test case, to demonstrate that
>>> we are making this choice explicitly.
>>
>> Yes.  I wonder if we can solve it by requiring a lot better than
>> minimum match when trying the "suffix match" first, or something?
> 
> This may still be a useful idea, and was something I had considered,
> but more in the context of more generic filename similarity
> comparisons.  We could still discuss it even when basenames match, but
> basenames matching seems strong enough to me that I wasn't sure extra
> configuration knobs were warranted.

I think this is a complication that we might not want to add to the
heuristic, at least not at first. We might want to have a follow-up
that adjusts that value to be higher. A natural way would be through
a config option, so users can select something incredibly high like
99%. Another option would be to take a minimum that is halfway between
the existing similarity minimum and 100%.

>> Provided if we agree that it is a good idea to insert this between
>> "exact contents match" and "full matrix", I have one question to
>> Elijah on what the code does.
>>
>> To me, it seems that the "full matrix" part still uses the remaining
>> src and dst candidates fully.  But if "A.txt" and "B.txt" are still
>> surviving in the src/dst at that stage, shouldn't we be saying that
>> "no way these can be similar enough---we've checked in the middle
>> stage where only the ones with the same suffix are considered and
>> this pair didn't turn into a rename"?
> 
> This is a very good point.  A.txt and B.txt will not have been
> compared previously since their basenames do not match, but the basic
> idea is still possible.  For example, A.txt could have been compared
> to source/some-module/A.txt.  And I don't do anything in the final
> "full matrix" stage to avoid re-comparing those two files again.
> However, it is worth noting that A.txt will have been compared to at
> most one other file, not N files.  And thus while we are wasting some
> re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2).
> I thought about that, but decided to not bother, based on the
> following thinking:
> 
> 1) The most expensive comparison is the first one, because when we do
> that one, we first have to populate the list of integers that lines in
> the file hash to.  Subsequent comparisons are relatively cheap since
> this list of integers has already been computed.
> 
> 2) This would only save us from at most N comparisons in the N x M
> matrix (since no file in this optimization is compared to more than
> one other)
> 
> 3) Checking if two files have previously been compared requires more
> code, in what is already a tight nested loop.  My experience
> attempting to modify that tight loop for extra conditions (e.g. don't
> compare files that are too large), is that it's easy to accidentally
> make the code slower.  In fact, this is in part what led to the
> addition of the remove_unneed_paths_from_src() function.

Even storing a single bit to say "these were already compared" takes
quadratic space. The hope is to not have quadratic behavior if it
can be avoided.

> 4) There were plenty of other interesting ideas and maybe I was a tad lazy.  :-)
> 
> I think removing these already-compared cases could be done, but I
> just avoided it.  If we were to do the "attempt to match files with
> the same extension" optimization that Stolee outlines/invents above,
> then we'd definitely need to consider it.  Otherwise, it's just a
> minor additional optimization that someone could add to my patches.

The more I think about it, the less my idea makes sense. I'm sorry
for adding noise to the thread.

Thanks,
-Stolee



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux