On 2/8/2021 3:27 AM, Elijah Newren wrote: > Hi, > > On Sun, Feb 7, 2021 at 6:38 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: >> >> On 2/6/21 5:52 PM, Elijah Newren via GitGitGadget wrote: >>> From: Elijah Newren <newren@xxxxxxxxx> >>> >>> Make use of the new find_basename_matches() function added in the last >>> two patches, to find renames more rapidly in cases where we can match up >>> files based on basenames. >> >> This is a valuable heuristic. >> >>> For the testcases mentioned in commit 557ac0350d ("merge-ort: begin >>> performance work; instrument with trace2_region_* calls", 2020-10-28), >>> this change improves the performance as follows: >>> >>> Before After >>> no-renames: 13.815 s ± 0.062 s 13.138 s ± 0.086 s >>> mega-renames: 1799.937 s ± 0.493 s 169.488 s ± 0.494 s >>> just-one-mega: 51.289 s ± 0.019 s 5.061 s ± 0.017 s >> >> These numbers are very impressive. >> >> Before I get too deep into reviewing these patches, I do want >> to make it clear that the speed-up is coming at the cost of >> a behavior change. We are restricting the "best match" search >> to be first among files with common base name (although maybe >> I would use 'suffix'?). If we search for a rename among all >> additions and deletions ending the ".txt" we might find a >> similarity match that is 60% and declare that a rename, even >> if there is a ".txt" -> ".md" pair that has a 70% match. > > I'm glad you all are open to possible behavioral changes, but I was > proposing a much smaller behavioral change that is quite different > than what you have suggested here. Perhaps my wording was poor; I > apologize for forgetting that "basename" has different meanings in > different contexts. Let me try again; I am not treating the filename > extension as special in any manner here; by "basename" I just mean the > portion of the path ignoring any leading directories. Thus > src/foo.txt > might be a good match against > source/foo.txt > but this optimization as a preliminary step would not consider > matching src/foo.txt against any of > source/bar.txt > source/foo.md > since the basenames ('bar.txt' and 'foo.md') do not match our original > file's basename ('foo.txt'). > > Of course, if this preliminary optimization step fails to find another > "foo.txt" to match src/foo.txt against (or finds more than one and > thus doesn't compare against any of them), then the fallback inexact > rename detection matrix might match it against either of those two > latter paths, as it always has. Thank you for making it clear that I had misunderstood what the optimization is actually doing. A much more narrow scope makes more sense, and avoids the quadratic problem even when many files of the same suffix are renamed. >> This could be documented in a test case, to demonstrate that >> we are making this choice explicitly. My test is thus bogus, but you could have a similar one for your actual optimization. >> So, in this way, we are changing the optimization function >> that is used to determine the "best" rename available. It >> might be good to update documentation for how we choose >> renames: > > Seems reasonable; I'll add some commentary below on the rules... Your commentary is helpful. I look forward to reading your carefully-written docs in the next version ;). >> i. among files with the same basename (trailer >> after final '.') select pairs with highest >> similarity. > > This is an interesting idea, but is not what I implemented. That's what I get for reading the commit messages quickly and commenting on what I _think_ is going on instead of actually reading the code carefully. Sorry about that. > It is > possible that your suggestion is also a useful optimization; it'd be > hard to know without trying. However, as noted in optimization batch > 8 that I'll be submitting later, I'm worried about having any > optimization pre-steps doing more than O(1) comparisons per path (and > here you suggest comparing each .txt file with all other .txt files); > doing that can interact badly with optimization batch 9. > Additionally, unless we do something to avoid re-comparing files again > when doing the later all-unmatched-files-against-each-other check, > then worst case behavior can approach twice as slow as the original > code. Right. If Git decides to reorganize all of its *.c files in one commit, we would still get quadratic behavior in rename detection. Maybe it's not _that_ much of an improvement. Thanks, -Stolee