On Mon, Feb 8, 2021 at 3:31 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > On 2/8/2021 3:27 AM, Elijah Newren wrote: > > Hi, > > > > On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > >> > >> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote: > >>> From: Elijah Newren <newren@xxxxxxxxx> > >>> > >>> Make use of the new find_basename_matches() function added in the last > >>> two patches, to find renames more rapidly in cases where we can match up > >>> files based on basenames. > >> > >> This is a valuable heuristic. > >> > >>> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin > >>> performance work; instrument with trace2_region_* calls", 2020-10-28), > >>> this change improves the performance as follows: > >>> > >>> Before After > >>> no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s > >>> mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s > >>> just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s > >> > >> These numbers are very impressive. > >> > >> Before I get too deep into reviewing these patches, I do want > >> to make it clear that the speed-up is coming at the cost of > >> a behavior change. We are restricting the "best match" search > >> to be first among files with common base name (although maybe > >> I would use 'suffix'?). If we search for a rename among all > >> additions and deletions ending the ".txt" we might find a > >> similarity match that is 60% and declare that a rename, even > >> if there is a ".txt" -> ".md" pair that has a 70% match. > > > > I'm glad you all are open to possible behavioral changes, but I was > > proposing a much smaller behavioral change that is quite different > > than what you have suggested here. Perhaps my wording was poor; I > > apologize for forgetting that "basename" has different meanings in > > different contexts. Let me try again; I am not treating the filename > > extension as special in any manner here; by "basename" I just mean the > > portion of the path ignoring any leading directories. Thus > > src/foo.txt > > might be a good match against > > source/foo.txt > > but this optimization as a preliminary step would not consider > > matching src/foo.txt against any of > > source/bar.txt > > source/foo.md > > since the basenames ('bar.txt' and 'foo.md') do not match our original > > file's basename ('foo.txt'). > > > > Of course, if this preliminary optimization step fails to find another > > "foo.txt" to match src/foo.txt against (or finds more than one and > > thus doesn't compare against any of them), then the fallback inexact > > rename detection matrix might match it against either of those two > > latter paths, as it always has. > > Thank you for making it clear that I had misunderstood what the > optimization is actually doing. A much more narrow scope makes > more sense, and avoids the quadratic problem even when many files > of the same suffix are renamed. > > >> This could be documented in a test case, to demonstrate that > >> we are making this choice explicitly. > > My test is thus bogus, but you could have a similar one for > your actual optimization. Yes, good point. > >> So, in this way, we are changing the optimization function > >> that is used to determine the "best" rename available. It > >> might be good to update documentation for how we choose > >> renames: > > > > Seems reasonable; I'll add some commentary below on the rules... > > Your commentary is helpful. I look forward to reading your > carefully-written docs in the next version ;). :-) > >> i. among files with the same basename (trailer > >> after final '.') select pairs with highest > >> similarity. > > > > This is an interesting idea, but is not what I implemented. > > That's what I get for reading the commit messages quickly and > commenting on what I _think_ is going on instead of actually > reading the code carefully. Sorry about that. There's absolutely no need to apologize. If you read all three commit messages and you didn't understand the idea, then clearly there's a bug in my commit messages. Thanks for highlighting it; I'll figure out how to reword or add extra verbiage to make it clear. Something in these follow-up emails seemed to work, so I'll try to incorporate stuff from them. > > It is > > possible that your suggestion is also a useful optimization; it'd be > > hard to know without trying. However, as noted in optimization batch > > 8 that I'll be submitting later, I'm worried about having any > > optimization pre-steps doing more than O(1) comparisons per path (and > > here you suggest comparing each .txt file with all other .txt files); > > doing that can interact badly with optimization batch 9. > > Additionally, unless we do something to avoid re-comparing files again > > when doing the later all-unmatched-files-against-each-other check, > > then worst case behavior can approach twice as slow as the original > > code. > > Right. If Git decides to reorganize all of its *.c files in one > commit, we would still get quadratic behavior in rename detection. > Maybe it's not _that_ much of an improvement. > > Thanks, > -Stolee