Re: [PATCH] nfsd: fallback to sync COPY if async not possible

"NeilBrown" <neilb@xxxxxxx> · Tue, 05 Nov 2024 07:30:03 +1100

On Tue, 05 Nov 2024, Chuck Lever wrote:
> On Mon, Nov 04, 2024 at 03:47:42PM +1100, NeilBrown wrote:
> > 
> > An NFSv4.2 COPY request can explicitly request a synchronous copy.  If
> > that is not requested then the server is free to perform the copy
> > synchronously or asynchronously.
> > 
> > In the Linux implementation an async copy requires more resources than a
> > sync copy.  If nfsd cannot allocate these resources, the best response
> > is to simply perform the copy (or the first 4MB of it) synchronously.
> > 
> > This choice may be debatable if the unavailable resource was due to
> > memory allocation failure - when memalloc fails it might be best to
> > simply give up as the server is clearly under load.  However in the case
> > that policy prevents another kthread being created there is no benefit
> > and much cost is failing with NFS4ERR_DELAY.  In that case it seems
> > reasonable to avoid that error in all circumstances.
> > 
> > So change the out_err case to retry as a sync copy.
> > 
> > Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations")
> 
> Hi Neil,
> 
> Why is a Fixes: tag necessary?
> 
> And why that commit? async copies can fail due to lack of resources
> on kernels that don't have aadc3bbea163, AFAICT.

I had hoped my commit message would have explained that, though I accept
it was not as explicit as it could be.

kmalloc(GFP_KERNEL) allocation failures aren't interesting.  They never
happen for smallish sizes, and if they do then the server is so borked
that it hardly matter what we do.

The fixed commit introduces a new failure mode that COULD easily be hit
in practice.  It causes the N+1st COPY to wait indefinitely until at
least one other copy completes which, as you observed in that commit,
could "run for a long time".  I don't think that behaviour is necessary
or appropriate.

Changing the handling for kmalloc failure was just an irrelevant
side-effect for changing the behaviour when then number of COPY requests
exceeded the number of configured threads.

This came up because CVE-2024-49974 was created so I had to do something
about the theoretical DoS vector in SLE kernels.  I didn't like the
patch so I backported

Commit 8d915bbf3926 ("NFSD: Force all NFSv4.2 COPY requests to be synchronous")

instead (and wondered why it hadn't gone to stable).

Thanks,
NeilBrown

> 
> 
> > Signed-off-by: NeilBrown <neilb@xxxxxxx>
> > ---
> >  fs/nfsd/nfs4proc.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
> > index fea171ffed62..06e0d9153ca9 100644
> > --- a/fs/nfsd/nfs4proc.c
> > +++ b/fs/nfsd/nfs4proc.c
> > @@ -1972,6 +1972,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  		wake_up_process(async_copy->copy_task);
> >  		status = nfs_ok;
> >  	} else {
> > +	retry_sync:
> >  		status = nfsd4_do_copy(copy, copy->nf_src->nf_file,
> >  				       copy->nf_dst->nf_file, true);
> >  	}
> > @@ -1990,8 +1991,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
> >  	}
> >  	if (async_copy)
> >  		cleanup_async_copy(async_copy);
> > -	status = nfserr_jukebox;
> > -	goto out;
> > +	goto retry_sync;
> >  }
> >  
> >  static struct nfsd4_copy *
> > 
> > base-commit: 26e6e693936986309c01e8bb80e318d63fda4a44
> > -- 
> > 2.47.0
> > 
> 
> -- 
> Chuck Lever
>