Elijah Newren <newren@xxxxxxxxx> writes: > I can change that. I can also simplify it further to > > if (0 <= (dst_index = (strintmap_get(&dests, base)))) { > > since dests uses a default value of -1. That will decrease the number > of strmap lookups here from 2 to 1. Which would be a real win, unlike what I said in the message you are responding to. >> It feels incompatible with the spirit of these two steps aim for >> (i.e. only use this optimization on a pair of src/dst with UNIQUE >> basenames). For the purpose of "we only handle unique ones", the >> paths that already have matched should participate in deciding if >> the files that survived "exact" phase have unique basename among >> the original inpu? > > Yeah, I should have been more careful with my wording. Stated a > different way, what confidence can we associate with an exact rename? Suppose you start with a/Makefile, b/Makefile and c/Makefile and then they all disappeared while a1/Makefile, b1/Makefile and c1/Makefile now are in the tree. The contents a/Makefile used to have appears without any difference in a1/Makefile, the same for b and b1, but c/Makefile and c1/Makefile are different. The c vs c1 pair may worth investigating, so it goes through the "same basename" phase. Now, in a slightly different situation, a vs a1 are still identical, but b vs b1 have only one blank line removal but without any other change. It looks odd that such a change has to pessimize c vs c1 optimization opportunity, but an interesting part of the story is that we can only say "such a change", not "such a miniscule change", because we have just finished the "exact" phase, and we do not know how big a difference b vs b1 pair actually had. That makes me feel that this whole "we must treat unique one that remains specially" is being incoherent. If "because we have only small number of removed and added Makefiles spread across the trees, first full-matrix matching among them without anything else with higher bar may be worth an optimization" were the optimization, then I would understand and support the design to omit those that have already been matched in the "exact" phase. But IIRC, limiting this "same basename" phase to unique add/del pair was sold as a way to make it less likely for the heuristics to make mistakes, yet the definition of "unique", as shown above, is not all that solid. That I find it rather unsatisfactory. In other words, it is not "what confidence do we have in exact phase?" "exact" matching may have found perfect matching pair. But the found pair should be happy just between themselves, and should not have undue effect on how _other_ pairs are compared. Stopping the "exact" pair from participating in the "uniqueness" definition is placing "exact" phase too much weight to affect how other filepairs are found. > By the exact same argument, you > could take this a step further and say that we should calculate the > basenames of *all* files in the tree, not just add/delete pairs, and > only match up the ones via basename that are *truly* unique. After > all, break detection exists, so perhaps we don't have full confidence > that files with an unchanged fullname are actually related. Sorry, but you are not making sense. These optimizations are done only when we are not using copies and breaks, no? What _other_ changes that kept the paths the same, or modified in place, have any effect on matching added and deleted pairs?