Re: [PATCH] nfsd: fallback to sync COPY if async not possible

"NeilBrown" <neilb@xxxxxxx> · Wed, 06 Nov 2024 08:06:40 +1100

On Wed, 06 Nov 2024, Chuck Lever wrote:
> On Tue, Nov 05, 2024 at 11:32:48AM +1100, NeilBrown wrote:
> > On Tue, 05 Nov 2024, Chuck Lever wrote:
> > > On Tue, Nov 05, 2024 at 07:30:03AM +1100, NeilBrown wrote:
> > > > On Tue, 05 Nov 2024, Chuck Lever wrote:
> > > > > On Mon, Nov 04, 2024 at 03:47:42PM +1100, NeilBrown wrote:
> > > > > > 
> > > > > > An NFSv4.2 COPY request can explicitly request a synchronous copy.  If
> > > > > > that is not requested then the server is free to perform the copy
> > > > > > synchronously or asynchronously.
> > > > > > 
> > > > > > In the Linux implementation an async copy requires more resources than a
> > > > > > sync copy.  If nfsd cannot allocate these resources, the best response
> > > > > > is to simply perform the copy (or the first 4MB of it) synchronously.
> > > > > > 
> > > > > > This choice may be debatable if the unavailable resource was due to
> > > > > > memory allocation failure - when memalloc fails it might be best to
> > > > > > simply give up as the server is clearly under load.  However in the case
> > > > > > that policy prevents another kthread being created there is no benefit
> > > > > > and much cost is failing with NFS4ERR_DELAY.  In that case it seems
> > > > > > reasonable to avoid that error in all circumstances.
> > > > > > 
> > > > > > So change the out_err case to retry as a sync copy.
> > > > > > 
> > > > > > Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations")
> > > > > 
> > > > > Hi Neil,
> > > > > 
> > > > > Why is a Fixes: tag necessary?
> > > > > 
> > > > > And why that commit? async copies can fail due to lack of resources
> > > > > on kernels that don't have aadc3bbea163, AFAICT.
> > > > 
> > > > I had hoped my commit message would have explained that, though I accept
> > > > it was not as explicit as it could be.
> > > 
> > > The problem might be that you and I have different understandings of
> > > what exactly aadc3bbea163 does.
> > 
> > It might be.
> > My understanding is that it limits the number of concurrent async
> > COPY requests to ->sp_nrthreads and once that limit in reached
> > any further COPY requests that don't explicitly request "synchronous"
> > are refused with NFS4ERR_DELAY.
> > 
> > > > kmalloc(GFP_KERNEL) allocation failures aren't interesting.  They never
> > > > happen for smallish sizes, and if they do then the server is so borked
> > > > that it hardly matter what we do.
> > > > 
> > > > The fixed commit introduces a new failure mode that COULD easily be hit
> > > > in practice.  It causes the N+1st COPY to wait indefinitely until at
> > > > least one other copy completes which, as you observed in that commit,
> > > > could "run for a long time".  I don't think that behaviour is necessary
> > > > or appropriate.
> > > 
> > > The waiting happens on the client. An async COPY operation always
> > > completes quickly on the server, in this case with NFS4ERR_DELAY. It
> > > does not tie up an nfsd thread.
> > 
> > Agreed that it doesn't tie up an nfsd thread.  It does tie up a separate
> > kthread for which there is a limit matching the number of nfsd threads
> > (in the pool).
> > 
> > Agreed that the waiting happens on the client, but why should there be
> > any waiting at all?  The client doesn't know what it is waiting for, so
> > will typically wait a few seconds.  In that time many megabytes of sync
> > COPY could have been processed.
> 
> The Linux NFS client's usual delay is, IIRC, 100 msec with
> exponential backoff.

Yep, up to 15seconds.

> 
> It's possible that the number of async copies is large because they
> are running on a slow export. Adding more copy work is going to make
> the situation worse -- and by handling the work with a synchronous
> COPY, it will tie up threads that should be available for other
> work.

Should be available for "work" yes, but why "other work"?  Is COPY work
not as important as READ or WRITE or GETATTR work?
READ/WRITE are limited to 1MB, sync-COPY to 4MB so a small difference
that, but it doesn't seem substantial.

> 
> The feedback loop here should result in reducing server workload,
> not increasing it.

I agree with not increasing it.  I don't see the rational for reducing
workload, only for limiting it.

> 
> 
> > > By the way, there are two fixes in this area that now appear in
> > > v6.12-rc6 that you should check out.
> > 
> > I'll try to schedule time to have a look - thanks.
> > 
> > > > Changing the handling for kmalloc failure was just an irrelevant
> > > > side-effect for changing the behaviour when then number of COPY requests
> > > > exceeded the number of configured threads.
> > > 
> > > aadc3bbea163 checks the number of concurrent /async/ COPY requests,
> > > which do not tie up nfsd threads, and thus are not limited by the
> > > svc_thread count, as synchronous COPY operations are by definition.
> > 
> > They are PRECISELY limited by the svc_thread count.  ->sp_nrthreads.
> 
> I was describing the situation /before/ aadc3bbea163 , when there
> was no limit at all.
> 
> Note that is an arbitrary limit. We could pick something else if
> this limit interferes with the dynamic thread count changes.
> 
> 
> > My current thinking is that we should not start extra threads for
> > handling async copies.  We should create a queue of pending copies and
> > any nfsd thread can dequeue a copy and process 4MB each time through
> > "The main request loop" just like it calls nfsd_file_net_dispose() to do
> > a little bit of work.
> 
> Having nfsd threads handle this workload again invites a DoS vector.

Any more so that having nfsd thread handling a WRITE workload?

> 
> The 4MB chunk limit is there precisely to prevent synchronous COPY
> operations from tying up nfsd threads for too long. On a slow export,
> this is still not going to work, so I'd rather see a timer for this
> protection; say, 30ms, rather than a byte count. If more than 4MB
> can be handled quickly, that will be good for throughput.

That sounds like a good goal.  Ideally we would need a way to negotiate
a window with write-back throttling so that we don't bother reading
until we know that writing to the page-cache won't block.  Certainly
worth exploring.

> 
> Note that we still want to limit the number of background copy
> operations going on. I don't want a mechanism where a client can
> start an unbounded amount of work on the server.

This isn't obvious to me.  The server makes no promise concerning the
throughput it will provide.  Having a list of COPY requests that add up
to many terabytes isn't intrinsically a problem.  Having millions of
COPY requests in the list *might* be a little bit of a burden.  Using
__GFP_NORETRY might help put a natural limit on that.

> 
> 
> > > > This came up because CVE-2024-49974 was created so I had to do something
> > > > about the theoretical DoS vector in SLE kernels.  I didn't like the
> > > > patch so I backported
> > > > 
> > > > Commit 8d915bbf3926 ("NFSD: Force all NFSv4.2 COPY requests to be synchronous")
> > > > 
> > > > instead (and wondered why it hadn't gone to stable).
> > > 
> > > I was conservative about requesting a backport here. However, if a
> > > CVE has been filed, and if there is no automation behind that
> > > process, you can explicitly request aadc3bbea163 be backported.
> > > 
> > > The problem, to me, was less about server resource depletion and
> > > more about client hangs.
> > 
> > And yet the patch that dealt with the less important server resource
> > depletion was marked for stable, and the patch that dealt with client
> > hangs wasn't??
> > 
> > The CVE was for that less important patch, probably because it contained
> > the magic word "DoS".
> 
> Quite likely. I wasn't consulted before the CVE was opened, nor was
> I notified that it had been created.
> 
> Note that distributions are encouraged to evaluate whether a CVE is
> serious enough to address, rather than simply backporting the fixes
> automatically. But I know some customers want every CVE handled, so
> that is sometimes difficult.

Yes, it needs to be handled.  Declaring it invalid is certainly an
option for handling it.  I didn't quite feel I could justify that in
this case.

> 
> 
> > I think 8d915bbf3926 should go to stable but I would like to understand
> > why you felt the need to be conservative.
> 
> First, I'm told that LTS generally avoids taking backports that
> overtly change user-visible behavior like disabling server-to-server
> copy (which requires async COPY to work). That was the main reason
> for my hesitance.

Why does server-to-server require async COPY?
RFC 7862 section 4.5.  Inter-Server Copy says
  The destination server may perform the copy synchronously or
  asynchronously.
but I see that nfsd4_copy() returns nfs4err_notsupp if the inter-server
copy_is_sync(), but it isn't immediately obvious to me why.  The patch
which landed this functionality doesn't explain the restriction.

I guess that with multiple 4MB sync COPY requests the server would need
to repeatedly mount and unmount the source server which could be
unnecessary work - or would need to cache the mount and know when to
unmount it....

On the other hand, changing the user-visible behaviour of the client
unexpected hanging waiting for a server-side copy completion
notification that it will never get seems like a user-visible change
that would be desirable.

I'm starting to see why this is awkward.

> 
> But more importantly, the problem with the automatic backport
> mechanism is that marked patches are taken /immediately/ into
> stable. They don't get the kind of soak time that a normally-merged
> unmarked patch gets. The only way to ensure they get any real-world
> test experience at all is to not mark them, and then come back to
> them later and explicitly request a backport.
> 
> And, generally, we want to know that a patch destined for LTS
> kernels has actually been applied to and tested on LTS first.
> Automatically backported patches don't get that verification at all.

I thought it was possible to mark patches to tell the stable team
exactly what you want.  Greg certainly seems eager to give maintainers as
much control as they ask for - without requiring them to do anything
they don't want to do.  If you have a clear idea of what you want, it
might be good to spell that out and ask how to achieve it.

> 
> My overall preference is that Fixed: patches should be ignored by
> the automation, and that we have a designated NFSD LTS maintainer
> who will test patches on each LTS kernel and request their backport.
> I haven't found anyone to do that work, so we are limping along with
> the current situation. I recognize, however, that this needs to
> improve somehow with only the maintainer resources we have.

:-)

Thanks,
NeilBrown

> 
> 
> > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > > > > ---
> > > > > >  fs/nfsd/nfs4proc.c | 4 ++--
> > > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > > > > > index fea171ffed62..06e0d9153ca9 100644
> > > > > > --- a/fs/nfsd/nfs4proc.c
> > > > > > +++ b/fs/nfsd/nfs4proc.c
> > > > > > @@ -1972,6 +1972,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> > > > > >  		wake_up_process(async_copy->copy_task);
> > > > > >  		status = nfs_ok;
> > > > > >  	} else {
> > > > > > +	retry_sync:
> > > > > >  		status = nfsd4_do_copy(copy, copy->nf_src->nf_file,
> > > > > >  				       copy->nf_dst->nf_file, true);
> > > > > >  	}
> > > > > > @@ -1990,8 +1991,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> > > > > >  	}
> > > > > >  	if (async_copy)
> > > > > >  		cleanup_async_copy(async_copy);
> > > > > > -	status = nfserr_jukebox;
> > > > > > -	goto out;
> > > > > > +	goto retry_sync;
> > > > > >  }
> > > > > >  
> > > > > >  static struct nfsd4_copy *
> > > > > > 
> > > > > > base-commit: 26e6e693936986309c01e8bb80e318d63fda4a44
> > > > > > -- 
> > > > > > 2.47.0
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Chuck Lever
> > > > > 
> > > > 
> > > 
> > > -- 
> > > Chuck Lever
> > > 
> > 
> 
> -- 
> Chuck Lever
>