Re: [PATCH v3 0/5] Optimization batch 13: partial clone optimizations for merge-ort

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jun 22, 2021 at 7:14 PM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> On 6/22/2021 2:45 PM, Elijah Newren wrote:
> > On Tue, Jun 22, 2021 at 9:10 AM Derrick Stolee <stolee@xxxxxxxxx> wrote:
>
> I want to focus on this item:
>
> >> 2. I watched for the partial clone logic to kick in and download blobs.
> >>    Some of these were inevitable: we need the blobs to resolve edit/edit
> >>    conflicts. Most cases none were downloaded at all, so this series is
> >>    working as advertised. There _was_ a case where the inexact rename
> >>    detection requested a large list of files (~2900 in three batches) but
> >>    _then_ said "inexact rename detection was skipped due to too many
> >>    files". This is a case that would be nice to resolve in this series. I
> >>    will try to find exactly where in the code this is being triggered and
> >>    report back.
> >
> > This suggests perhaps that EITHER there was a real modify/delete
> > conflict (because you have to do full rename detection to rule out
> > that the modify/delete was part of some rename), OR that there was a
> > renamed file modified on both sides that did not keep its original
> > basename (because that combination is needed to bypass the various
> > optimizations and make it fall back to full inexact rename detection).
> > Further, in either case, there were enough adds/deletes that full
> > inexact detection is still a bit expensive.  It'd be interesting to
> > know which case it was.  What happens if you set merge.renameLimit to
> > something higher (the default is surprisingly small)?
>
> The behavior I'd like to see is that the partial clone logic is not
> run if we are going to download more than merge.renameLimit files.
> Whatever is getting these missing blobs is earlier than the limit
> check, but it should be after instead.
>
> It's particularly problematic that Git does all the work to get the
> blobs, but then gives up and doesn't even use them for rename
> detection.

I agree with what should happen, but I'm surprised it's not already
happening.  The give up check comes from too_many_rename_candidates(),
which is called before dpf_options.missing_object_cb is even set to
inexact_prefetch.  So I'm not sure how the fetching comes first.  Is
there a microsoft specific patch that changes the order somehow?  Is
there something I'm mis-reading in the code?

> I'm happy that we download necessary blobs when there are a few
> dozen files that need inexact renames. When it gets into the
> thousands, then we jump into a different category of user experience.
>
> Having a stop-gap of rename detection limits is an important way to
> avoid huge amounts of file downloads in these huge repo cases. Users
> can always opt into a larger limit if they really do want that rename
> detection to work at such a large scale, but we still need protections
> for the vast majority of cases where a user isn't willing to pay the
> cost of downloading these blobs.

Sure, that's fair.

But I'm still curious what the particular shape is for the data in
question.  What does the error say merge.renameLimit would be need to
set to?  If it's set higher, do some of the files resolve nicely (i.e.
they really were renames modified on both sides with a different
basename), or are they modify/delete conflicts and we're paying the
cost of rename detection to verify they were deleted and we really do
have a conflict?  I'm curious if there's more to learn and more
optimization potential, basically.  Your repository is bigger, so
there may be more to learn from it than from the testcases I've tried
so far.  :-)



[Index of Archives]     [Linux Kernel Development]     [Gcc Help]     [IETF Annouce]     [DCCP]     [Netdev]     [Networking]     [Security]     [V4L]     [Bugtraq]     [Yosemite]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux SCSI]     [Fedora Users]

  Powered by Linux