On Wed, Jun 23, 2021 at 1:11 AM Elijah Newren <newren@xxxxxxxxx> wrote: > > On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > > > On 6/22/2021 2:45 PM, Elijah Newren wrote: > > > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@xxxxxxxxx> wrote: > > > > I want to focus on this item: > > > > >> 2. I watched for the partial clone logic to kick in and download blobs. > > >> Some of these were inevitable: we need the blobs to resolve edit/edit > > >> conflicts. Most cases none were downloaded at all, so this series is > > >> working as advertised. There _was_ a case where the inexact rename > > >> detection requested a large list of files (~2900 in three batches) but > > >> _then_ said "inexact rename detection was skipped due to too many > > >> files". This is a case that would be nice to resolve in this series. I > > >> will try to find exactly where in the code this is being triggered and > > >> report back. > > > > > > This suggests perhaps that EITHER there was a real modify/delete > > > conflict (because you have to do full rename detection to rule out > > > that the modify/delete was part of some rename), OR that there was a > > > renamed file modified on both sides that did not keep its original > > > basename (because that combination is needed to bypass the various > > > optimizations and make it fall back to full inexact rename detection). > > > Further, in either case, there were enough adds/deletes that full > > > inexact detection is still a bit expensive. It'd be interesting to > > > know which case it was. What happens if you set merge.renameLimit to > > > something higher (the default is surprisingly small)? > > > > The behavior I'd like to see is that the partial clone logic is not > > run if we are going to download more than merge.renameLimit files. > > Whatever is getting these missing blobs is earlier than the limit > > check, but it should be after instead. > > > > It's particularly problematic that Git does all the work to get the > > blobs, but then gives up and doesn't even use them for rename > > detection. > > I agree with what should happen, but I'm surprised it's not already > happening. The give up check comes from too_many_rename_candidates(), > which is called before dpf_options.missing_object_cb is even set to > inexact_prefetch. So I'm not sure how the fetching comes first. Is > there a microsoft specific patch that changes the order somehow? Is > there something I'm mis-reading in the code? After thinking about it more, I wonder if the following is what is happening: * Some directory was renamed (and likely not a leaf), resulting in a large pile of renames (or some other change that gives lots of renames without changing basename) * basename_prefetch() kicks in to download files whose basenames have changed (even for "irrelevant" renames) * Basename matching is performed (which is linear in the number of renames, and not subject to the merge.renameLimit) * After basename matching, there are still unmatched destinations and relevant sources; which would need to be handled by the quadratic inexact matching algorithm. * Since there are enough renames left to trigger the renameLimit, you get the "inexact rename detection was skipped due to too many files" warning. and then you assumed the prefetching was for the quadratic inexact rename detection when in reality it was just for the linear basename matching. If so, there's a couple things we could do here: * The easy change: modify the warning, perhaps to something like "quadratic inexact rename detection was skipped...", to make it better reflect its meaning (basename matching is also an inexact match) * The more complicated change: be more aggressive about not detecting renames via basename match for "irrelevant" sources...perhaps avoiding it when it involves prefetching, or maybe just when we can somehow determine that we'd bail on the quadratic inexact rename detection anyway. The second point perhaps deserves a bit more explanation. Basename matching has an advantage of removing both sources and destinations from the later quadratic (O(N*M)) inexact rename detection, so it was generally beneficial to just always do the basename matching even if the path wasn't relevant for either content or location reasons. But that was presuming later quadratic inexact rename detection was going to run and not be skipped. It was one of those tradeoffs in optimization. But if the file hasn't been prefetched and we're likely to skip the quadratic inexact rename detection, then trying to do basename matching for an irrelevant rename is wasted work, as is prefetching the blobs in order to be able to detect that irrelevant rename. But how do we know if we'll skip the quadratic rename detection if we don't match the basenames that we can, since every matched basename removes both a source and a destination from the unmatched pairs? Maybe we should reorder the basename matching as a two-pass algorithm, where we first detect basename-matching renames for all relevant sources, and then repeat for non-relevant sources. That would also allow us to insert a check before the second pass, where if the first pass removed all relevant sources (meaning both that no relevant sources were left out of basename matching and we found a match for every relevant source included in basename matching), then we can exit without doing the second pass. That might even improve the performance in cases without prefetching. But it would mean adding another prefetch step. After doing the above splitting, then we could add extra conditions before the second step, such as bailing on it if we think we'd bail on quadratic inexact rename detection (which may still be hard to guess). Or maybe even just bailing on the second step if prefetching would be involved, because we'd rather lump those all into the quadratic inexact rename detection anyway. > But I'm still curious what the particular shape is for the data in > question. What does the error say merge.renameLimit would be need to > set to? If it's set higher, do some of the files resolve nicely (i.e. > they really were renames modified on both sides with a different > basename), or are they modify/delete conflicts and we're paying the > cost of rename detection to verify they were deleted and we really do > have a conflict? I'm curious if there's more to learn and more > optimization potential, basically. Your repository is bigger, so > there may be more to learn from it than from the testcases I've tried > so far. :-) We might have just identified multiple additional optimization opportunities above, neither of which is included in my pending optimization series. It would be helpful to get more details about how frequently these kind of cases occur, the particular renameLimit in use, the number of paths involved (number of unmatched source and destinations), how many of the sources are relevant (perhaps even broken down into content relevant and location relevant), etc., as these all may help inform the implementation and whatever tests we want to add to the testsuite. However, some of that info is currently hard to gather. I could probably start by adding some trace2 statistics to diffcore-rename to print out the original number of sources and destinations, the number matched via exact matching, the number matched via basename matching, the number of sources removed due to being irrelevant, and anything else I might be overlooking at the moment to help gather relevant data.