On Mon, Feb 8, 2021 at 3:44 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 2/8/2021 3:38 AM, Elijah Newren wrote: > > On Sun, Feb 7, 2021 at 11:51 AM Junio C Hamano <gitster@xxxxxxxxx> wrote: > >> > >> Derrick Stolee <stolee@xxxxxxxxx> writes: > >> > >>> Before I get too deep into reviewing these patches, I do want > >>> to make it clear that the speed-up is coming at the cost of > >>> a behavior change. We are restricting the "best match" search > >>> to be first among files with common base name (although maybe > >>> I would use 'suffix'?). If we search for a rename among all > >>> additions and deletions ending the ".txt" we might find a > >>> similarity match that is 60% and declare that a rename, even > >>> if there is a ".txt" -> ".md" pair that has a 70% match. > >> > >> Yes, my initial reaction to the idea was that "yuck, our rename > >> detection lost its purity". diffcore-rename strived to base its > >> decision purely on content similarity, primarily because it is one > >> of the oldest part of Git where the guiding principle has always > >> been that the content is the king. I think my aversion to the "all > >> of my neighbors are relocating, so should I move to the same place" > >> (aka "directory rename") comes from a similar place, but in a sense > >> this was worse. > >> > >> At least, until I got over the initial bump. I do not think the > >> suffix match is necessarily a bad idea, but it adds more "magically > >> doing a wrong thing" failure modes (e.g. the ".txt" to ".md" example > >> would probably have more variants that impact the real life > >> projects; ".C" vs ".cc" vs ".cxx" vs ".cpp" immediately comes to > >> mind), and a tool that silently does a wrong thing because it uses > >> more magic would be a tool that is hard to explain why it did the > >> wrong thing when it does. > > > > Stolee explained a new algorithm different than what I have proposed, > > Yes, sorry for adding noise. The point stands that we are changing > the behavior in some cases, so that must be agreed upon. What you > are _actually_ proposing is a much smaller change than I thought, > but it is still worth pointing out the behavior change. Again, no need to apologize; if my commit messages weren't clear, they need to be fixed. I am much more surprised here by your repeated point that it's worth pointing out the behavior change. I totally agree with that, and it's why I spent two paragraphs on the second commit message explicitly covering this and listing four enumerated reasons to argue for the change. So, I thought I had done that, but it apparently isn't very clear...and I'm left wondering how to clarify it further. So, a question for you: How should I change it? Should I just modify the third commit message to re-highlight that it does change behavior and refer to the second commit message for details? Because repeating the point is the only way I can think of to make it clearer. Is there anything else I can or should do? > > I think based on the apparent different meanings of "basename" that > > exist. I tried to clarify that in response to his email, but I wanted > > to clarify one additional thing here too: > >> diffcore-rename has not in the past based its decision solely on > > content similarity. It only does that when break detection is on. > > Otherwise, 0% content similarity is trumped by sufficient filename > > similarity (namely, with a filename similarity of 100%). If the > > filename similarity wasn't sufficiently high (anything less than an > > exact match), then it completely ignored filename similarity and > > looked only at content similarity. It thus jumped from one extreme to > > another. > > This idea of optimizing first for 100% filename similarity is a > good perspective on Git's rename detection algorithm. The canonical > example of this 100% filename similarity is a rename cycle: > > A -> B > B -> C > C -> A > > Even if the OIDs are distinct and exactly match across these renames, > we see that there are no adds or deletes, so we do not even trigger > rename detection and report A, B, and C as edited instead. > > A "rename path" (not cycle) such as: > > A -> B > B -> C > > does trigger rename detection, but B will never be considered. Instead, > "A -> C" will be checked for similarity to see if it is within the > threshold. > > Of course, I am _not_ advocating that we change this behavior. These > situations are incredibly rare and we should not sacrifice performance > in the typical case to handle them. > > > My optimization is adding an in-between state. When the basename (the > > part of the path excluding the leading directory) matches the basename > > of another file (and those basenames are unique on each side), then > > compare content similarity and mark the files as a rename if the two > > are sufficiently similar. It is thus a position that considers both > > filename similarity (basename match) and content similarity together. > > > >>> This could be documented in a test case, to demonstrate that > >>> we are making this choice explicitly. > >> > >> Yes. I wonder if we can solve it by requiring a lot better than > >> minimum match when trying the "suffix match" first, or something? > > > > This may still be a useful idea, and was something I had considered, > > but more in the context of more generic filename similarity > > comparisons. We could still discuss it even when basenames match, but > > basenames matching seems strong enough to me that I wasn't sure extra > > configuration knobs were warranted. > > I think this is a complication that we might not want to add to the > heuristic, at least not at first. We might want to have a follow-up > that adjusts that value to be higher. A natural way would be through > a config option, so users can select something incredibly high like > 99%. Another option would be to take a minimum that is halfway between > the existing similarity minimum and 100%. > > >> Provided if we agree that it is a good idea to insert this between > >> "exact contents match" and "full matrix", I have one question to > >> Elijah on what the code does. > >> > >> To me, it seems that the "full matrix" part still uses the remaining > >> src and dst candidates fully. But if "A.txt" and "B.txt" are still > >> surviving in the src/dst at that stage, shouldn't we be saying that > >> "no way these can be similar enough---we've checked in the middle > >> stage where only the ones with the same suffix are considered and > >> this pair didn't turn into a rename"? > > > > This is a very good point. A.txt and B.txt will not have been > > compared previously since their basenames do not match, but the basic > > idea is still possible. For example, A.txt could have been compared > > to source/some-module/A.txt. And I don't do anything in the final > > "full matrix" stage to avoid re-comparing those two files again. > > However, it is worth noting that A.txt will have been compared to at > > most one other file, not N files. And thus while we are wasting some > > re-comparisons, it is at most O(N) duplicated comparisons, not O(N^2). > > I thought about that, but decided to not bother, based on the > > following thinking: > > > > 1) The most expensive comparison is the first one, because when we do > > that one, we first have to populate the list of integers that lines in > > the file hash to. Subsequent comparisons are relatively cheap since > > this list of integers has already been computed. > > > > 2) This would only save us from at most N comparisons in the N x M > > matrix (since no file in this optimization is compared to more than > > one other) > > > > 3) Checking if two files have previously been compared requires more > > code, in what is already a tight nested loop. My experience > > attempting to modify that tight loop for extra conditions (e.g. don't > > compare files that are too large), is that it's easy to accidentally > > make the code slower. In fact, this is in part what led to the > > addition of the remove_unneed_paths_from_src() function. > > Even storing a single bit to say "these were already compared" takes > quadratic space. The hope is to not have quadratic behavior if it > can be avoided. > > > 4) There were plenty of other interesting ideas and maybe I was a tad lazy. :-) > > > > I think removing these already-compared cases could be done, but I > > just avoided it. If we were to do the "attempt to match files with > > the same extension" optimization that Stolee outlines/invents above, > > then we'd definitely need to consider it. Otherwise, it's just a > > minor additional optimization that someone could add to my patches. > > The more I think about it, the less my idea makes sense. I'm sorry > for adding noise to the thread. > > Thanks, > -Stolee