Re: [PATCH] nfsd: fallback to sync COPY if async not possible

Chuck Lever <chuck.lever@xxxxxxxxxx> · Tue, 5 Nov 2024 09:46:09 -0500

On Tue, Nov 05, 2024 at 11:32:48AM +1100, NeilBrown wrote:
> On Tue, 05 Nov 2024, Chuck Lever wrote:
> > On Tue, Nov 05, 2024 at 07:30:03AM +1100, NeilBrown wrote:
> > > On Tue, 05 Nov 2024, Chuck Lever wrote:
> > > > On Mon, Nov 04, 2024 at 03:47:42PM +1100, NeilBrown wrote:
> > > > > 
> > > > > An NFSv4.2 COPY request can explicitly request a synchronous copy.  If
> > > > > that is not requested then the server is free to perform the copy
> > > > > synchronously or asynchronously.
> > > > > 
> > > > > In the Linux implementation an async copy requires more resources than a
> > > > > sync copy.  If nfsd cannot allocate these resources, the best response
> > > > > is to simply perform the copy (or the first 4MB of it) synchronously.
> > > > > 
> > > > > This choice may be debatable if the unavailable resource was due to
> > > > > memory allocation failure - when memalloc fails it might be best to
> > > > > simply give up as the server is clearly under load.  However in the case
> > > > > that policy prevents another kthread being created there is no benefit
> > > > > and much cost is failing with NFS4ERR_DELAY.  In that case it seems
> > > > > reasonable to avoid that error in all circumstances.
> > > > > 
> > > > > So change the out_err case to retry as a sync copy.
> > > > > 
> > > > > Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations")
> > > > 
> > > > Hi Neil,
> > > > 
> > > > Why is a Fixes: tag necessary?
> > > > 
> > > > And why that commit? async copies can fail due to lack of resources
> > > > on kernels that don't have aadc3bbea163, AFAICT.
> > > 
> > > I had hoped my commit message would have explained that, though I accept
> > > it was not as explicit as it could be.
> > 
> > The problem might be that you and I have different understandings of
> > what exactly aadc3bbea163 does.
> 
> It might be.
> My understanding is that it limits the number of concurrent async
> COPY requests to ->sp_nrthreads and once that limit in reached
> any further COPY requests that don't explicitly request "synchronous"
> are refused with NFS4ERR_DELAY.
> 
> > > kmalloc(GFP_KERNEL) allocation failures aren't interesting.  They never
> > > happen for smallish sizes, and if they do then the server is so borked
> > > that it hardly matter what we do.
> > > 
> > > The fixed commit introduces a new failure mode that COULD easily be hit
> > > in practice.  It causes the N+1st COPY to wait indefinitely until at
> > > least one other copy completes which, as you observed in that commit,
> > > could "run for a long time".  I don't think that behaviour is necessary
> > > or appropriate.
> > 
> > The waiting happens on the client. An async COPY operation always
> > completes quickly on the server, in this case with NFS4ERR_DELAY. It
> > does not tie up an nfsd thread.
> 
> Agreed that it doesn't tie up an nfsd thread.  It does tie up a separate
> kthread for which there is a limit matching the number of nfsd threads
> (in the pool).
> 
> Agreed that the waiting happens on the client, but why should there be
> any waiting at all?  The client doesn't know what it is waiting for, so
> will typically wait a few seconds.  In that time many megabytes of sync
> COPY could have been processed.

The Linux NFS client's usual delay is, IIRC, 100 msec with
exponential backoff.

It's possible that the number of async copies is large because they
are running on a slow export. Adding more copy work is going to make
the situation worse -- and by handling the work with a synchronous
COPY, it will tie up threads that should be available for other
work.

The feedback loop here should result in reducing server workload,
not increasing it.

> > By the way, there are two fixes in this area that now appear in
> > v6.12-rc6 that you should check out.
> 
> I'll try to schedule time to have a look - thanks.
> 
> > > Changing the handling for kmalloc failure was just an irrelevant
> > > side-effect for changing the behaviour when then number of COPY requests
> > > exceeded the number of configured threads.
> > 
> > aadc3bbea163 checks the number of concurrent /async/ COPY requests,
> > which do not tie up nfsd threads, and thus are not limited by the
> > svc_thread count, as synchronous COPY operations are by definition.
> 
> They are PRECISELY limited by the svc_thread count.  ->sp_nrthreads.

I was describing the situation /before/ aadc3bbea163 , when there
was no limit at all.

Note that is an arbitrary limit. We could pick something else if
this limit interferes with the dynamic thread count changes.

> My current thinking is that we should not start extra threads for
> handling async copies.  We should create a queue of pending copies and
> any nfsd thread can dequeue a copy and process 4MB each time through
> "The main request loop" just like it calls nfsd_file_net_dispose() to do
> a little bit of work.

Having nfsd threads handle this workload again invites a DoS vector.

The 4MB chunk limit is there precisely to prevent synchronous COPY
operations from tying up nfsd threads for too long. On a slow export,
this is still not going to work, so I'd rather see a timer for this
protection; say, 30ms, rather than a byte count. If more than 4MB
can be handled quickly, that will be good for throughput.

Note that we still want to limit the number of background copy
operations going on. I don't want a mechanism where a client can
start an unbounded amount of work on the server.

> > > This came up because CVE-2024-49974 was created so I had to do something
> > > about the theoretical DoS vector in SLE kernels.  I didn't like the
> > > patch so I backported
> > > 
> > > Commit 8d915bbf3926 ("NFSD: Force all NFSv4.2 COPY requests to be synchronous")
> > > 
> > > instead (and wondered why it hadn't gone to stable).
> > 
> > I was conservative about requesting a backport here. However, if a
> > CVE has been filed, and if there is no automation behind that
> > process, you can explicitly request aadc3bbea163 be backported.
> > 
> > The problem, to me, was less about server resource depletion and
> > more about client hangs.
> 
> And yet the patch that dealt with the less important server resource
> depletion was marked for stable, and the patch that dealt with client
> hangs wasn't??
> 
> The CVE was for that less important patch, probably because it contained
> the magic word "DoS".

Quite likely. I wasn't consulted before the CVE was opened, nor was
I notified that it had been created.

Note that distributions are encouraged to evaluate whether a CVE is
serious enough to address, rather than simply backporting the fixes
automatically. But I know some customers want every CVE handled, so
that is sometimes difficult.

> I think 8d915bbf3926 should go to stable but I would like to understand
> why you felt the need to be conservative.

First, I'm told that LTS generally avoids taking backports that
overtly change user-visible behavior like disabling server-to-server
copy (which requires async COPY to work). That was the main reason
for my hesitance.

But more importantly, the problem with the automatic backport
mechanism is that marked patches are taken /immediately/ into
stable. They don't get the kind of soak time that a normally-merged
unmarked patch gets. The only way to ensure they get any real-world
test experience at all is to not mark them, and then come back to
them later and explicitly request a backport.

And, generally, we want to know that a patch destined for LTS
kernels has actually been applied to and tested on LTS first.
Automatically backported patches don't get that verification at all.

My overall preference is that Fixed: patches should be ignored by
the automation, and that we have a designated NFSD LTS maintainer
who will test patches on each LTS kernel and request their backport.
I haven't found anyone to do that work, so we are limping along with
the current situation. I recognize, however, that this needs to
improve somehow with only the maintainer resources we have.

> > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > > > > ---
> > > > >  fs/nfsd/nfs4proc.c | 4 ++--
> > > > >  1 file changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > > > > index fea171ffed62..06e0d9153ca9 100644
> > > > > --- a/fs/nfsd/nfs4proc.c
> > > > > +++ b/fs/nfsd/nfs4proc.c
> > > > > @@ -1972,6 +1972,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> > > > >  		wake_up_process(async_copy->copy_task);
> > > > >  		status = nfs_ok;
> > > > >  	} else {
> > > > > +	retry_sync:
> > > > >  		status = nfsd4_do_copy(copy, copy->nf_src->nf_file,
> > > > >  				       copy->nf_dst->nf_file, true);
> > > > >  	}
> > > > > @@ -1990,8 +1991,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> > > > >  	}
> > > > >  	if (async_copy)
> > > > >  		cleanup_async_copy(async_copy);
> > > > > -	status = nfserr_jukebox;
> > > > > -	goto out;
> > > > > +	goto retry_sync;
> > > > >  }
> > > > >  
> > > > >  static struct nfsd4_copy *
> > > > > 
> > > > > base-commit: 26e6e693936986309c01e8bb80e318d63fda4a44
> > > > > -- 
> > > > > 2.47.0
> > > > > 
> > > > 
> > > > -- 
> > > > Chuck Lever
> > > > 
> > > 
> > 
> > -- 
> > Chuck Lever
> > 
> 

-- 
Chuck Lever