On Wed, 06 Nov 2024, Chuck Lever wrote: > On Tue, Nov 05, 2024 at 11:32:48AM +1100, NeilBrown wrote: > > On Tue, 05 Nov 2024, Chuck Lever wrote: > > > On Tue, Nov 05, 2024 at 07:30:03AM +1100, NeilBrown wrote: > > > > On Tue, 05 Nov 2024, Chuck Lever wrote: > > > > > On Mon, Nov 04, 2024 at 03:47:42PM +1100, NeilBrown wrote: > > > > > > > > > > > > An NFSv4.2 COPY request can explicitly request a synchronous copy. If > > > > > > that is not requested then the server is free to perform the copy > > > > > > synchronously or asynchronously. > > > > > > > > > > > > In the Linux implementation an async copy requires more resources than a > > > > > > sync copy. If nfsd cannot allocate these resources, the best response > > > > > > is to simply perform the copy (or the first 4MB of it) synchronously. > > > > > > > > > > > > This choice may be debatable if the unavailable resource was due to > > > > > > memory allocation failure - when memalloc fails it might be best to > > > > > > simply give up as the server is clearly under load. However in the case > > > > > > that policy prevents another kthread being created there is no benefit > > > > > > and much cost is failing with NFS4ERR_DELAY. In that case it seems > > > > > > reasonable to avoid that error in all circumstances. > > > > > > > > > > > > So change the out_err case to retry as a sync copy. > > > > > > > > > > > > Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") > > > > > > > > > > Hi Neil, > > > > > > > > > > Why is a Fixes: tag necessary? > > > > > > > > > > And why that commit? async copies can fail due to lack of resources > > > > > on kernels that don't have aadc3bbea163, AFAICT. > > > > > > > > I had hoped my commit message would have explained that, though I accept > > > > it was not as explicit as it could be. > > > > > > The problem might be that you and I have different understandings of > > > what exactly aadc3bbea163 does. > > > > It might be. > > My understanding is that it limits the number of concurrent async > > COPY requests to ->sp_nrthreads and once that limit in reached > > any further COPY requests that don't explicitly request "synchronous" > > are refused with NFS4ERR_DELAY. > > > > > > kmalloc(GFP_KERNEL) allocation failures aren't interesting. They never > > > > happen for smallish sizes, and if they do then the server is so borked > > > > that it hardly matter what we do. > > > > > > > > The fixed commit introduces a new failure mode that COULD easily be hit > > > > in practice. It causes the N+1st COPY to wait indefinitely until at > > > > least one other copy completes which, as you observed in that commit, > > > > could "run for a long time". I don't think that behaviour is necessary > > > > or appropriate. > > > > > > The waiting happens on the client. An async COPY operation always > > > completes quickly on the server, in this case with NFS4ERR_DELAY. It > > > does not tie up an nfsd thread. > > > > Agreed that it doesn't tie up an nfsd thread. It does tie up a separate > > kthread for which there is a limit matching the number of nfsd threads > > (in the pool). > > > > Agreed that the waiting happens on the client, but why should there be > > any waiting at all? The client doesn't know what it is waiting for, so > > will typically wait a few seconds. In that time many megabytes of sync > > COPY could have been processed. > > The Linux NFS client's usual delay is, IIRC, 100 msec with > exponential backoff. Yep, up to 15seconds. > > It's possible that the number of async copies is large because they > are running on a slow export. Adding more copy work is going to make > the situation worse -- and by handling the work with a synchronous > COPY, it will tie up threads that should be available for other > work. Should be available for "work" yes, but why "other work"? Is COPY work not as important as READ or WRITE or GETATTR work? READ/WRITE are limited to 1MB, sync-COPY to 4MB so a small difference that, but it doesn't seem substantial. > > The feedback loop here should result in reducing server workload, > not increasing it. I agree with not increasing it. I don't see the rational for reducing workload, only for limiting it. > > > > > By the way, there are two fixes in this area that now appear in > > > v6.12-rc6 that you should check out. > > > > I'll try to schedule time to have a look - thanks. > > > > > > Changing the handling for kmalloc failure was just an irrelevant > > > > side-effect for changing the behaviour when then number of COPY requests > > > > exceeded the number of configured threads. > > > > > > aadc3bbea163 checks the number of concurrent /async/ COPY requests, > > > which do not tie up nfsd threads, and thus are not limited by the > > > svc_thread count, as synchronous COPY operations are by definition. > > > > They are PRECISELY limited by the svc_thread count. ->sp_nrthreads. > > I was describing the situation /before/ aadc3bbea163 , when there > was no limit at all. > > Note that is an arbitrary limit. We could pick something else if > this limit interferes with the dynamic thread count changes. > > > > My current thinking is that we should not start extra threads for > > handling async copies. We should create a queue of pending copies and > > any nfsd thread can dequeue a copy and process 4MB each time through > > "The main request loop" just like it calls nfsd_file_net_dispose() to do > > a little bit of work. > > Having nfsd threads handle this workload again invites a DoS vector. Any more so that having nfsd thread handling a WRITE workload? > > The 4MB chunk limit is there precisely to prevent synchronous COPY > operations from tying up nfsd threads for too long. On a slow export, > this is still not going to work, so I'd rather see a timer for this > protection; say, 30ms, rather than a byte count. If more than 4MB > can be handled quickly, that will be good for throughput. That sounds like a good goal. Ideally we would need a way to negotiate a window with write-back throttling so that we don't bother reading until we know that writing to the page-cache won't block. Certainly worth exploring. > > Note that we still want to limit the number of background copy > operations going on. I don't want a mechanism where a client can > start an unbounded amount of work on the server. This isn't obvious to me. The server makes no promise concerning the throughput it will provide. Having a list of COPY requests that add up to many terabytes isn't intrinsically a problem. Having millions of COPY requests in the list *might* be a little bit of a burden. Using __GFP_NORETRY might help put a natural limit on that. > > > > > > This came up because CVE-2024-49974 was created so I had to do something > > > > about the theoretical DoS vector in SLE kernels. I didn't like the > > > > patch so I backported > > > > > > > > Commit 8d915bbf3926 ("NFSD: Force all NFSv4.2 COPY requests to be synchronous") > > > > > > > > instead (and wondered why it hadn't gone to stable). > > > > > > I was conservative about requesting a backport here. However, if a > > > CVE has been filed, and if there is no automation behind that > > > process, you can explicitly request aadc3bbea163 be backported. > > > > > > The problem, to me, was less about server resource depletion and > > > more about client hangs. > > > > And yet the patch that dealt with the less important server resource > > depletion was marked for stable, and the patch that dealt with client > > hangs wasn't?? > > > > The CVE was for that less important patch, probably because it contained > > the magic word "DoS". > > Quite likely. I wasn't consulted before the CVE was opened, nor was > I notified that it had been created. > > Note that distributions are encouraged to evaluate whether a CVE is > serious enough to address, rather than simply backporting the fixes > automatically. But I know some customers want every CVE handled, so > that is sometimes difficult. Yes, it needs to be handled. Declaring it invalid is certainly an option for handling it. I didn't quite feel I could justify that in this case. > > > > I think 8d915bbf3926 should go to stable but I would like to understand > > why you felt the need to be conservative. > > First, I'm told that LTS generally avoids taking backports that > overtly change user-visible behavior like disabling server-to-server > copy (which requires async COPY to work). That was the main reason > for my hesitance. Why does server-to-server require async COPY? RFC 7862 section 4.5. Inter-Server Copy says The destination server may perform the copy synchronously or asynchronously. but I see that nfsd4_copy() returns nfs4err_notsupp if the inter-server copy_is_sync(), but it isn't immediately obvious to me why. The patch which landed this functionality doesn't explain the restriction. I guess that with multiple 4MB sync COPY requests the server would need to repeatedly mount and unmount the source server which could be unnecessary work - or would need to cache the mount and know when to unmount it.... On the other hand, changing the user-visible behaviour of the client unexpected hanging waiting for a server-side copy completion notification that it will never get seems like a user-visible change that would be desirable. I'm starting to see why this is awkward. > > But more importantly, the problem with the automatic backport > mechanism is that marked patches are taken /immediately/ into > stable. They don't get the kind of soak time that a normally-merged > unmarked patch gets. The only way to ensure they get any real-world > test experience at all is to not mark them, and then come back to > them later and explicitly request a backport. > > And, generally, we want to know that a patch destined for LTS > kernels has actually been applied to and tested on LTS first. > Automatically backported patches don't get that verification at all. I thought it was possible to mark patches to tell the stable team exactly what you want. Greg certainly seems eager to give maintainers as much control as they ask for - without requiring them to do anything they don't want to do. If you have a clear idea of what you want, it might be good to spell that out and ask how to achieve it. > > My overall preference is that Fixed: patches should be ignored by > the automation, and that we have a designated NFSD LTS maintainer > who will test patches on each LTS kernel and request their backport. > I haven't found anyone to do that work, so we are limping along with > the current situation. I recognize, however, that this needs to > improve somehow with only the maintainer resources we have. :-) Thanks, NeilBrown > > > > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > > > > > --- > > > > > > fs/nfsd/nfs4proc.c | 4 ++-- > > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c > > > > > > index fea171ffed62..06e0d9153ca9 100644 > > > > > > --- a/fs/nfsd/nfs4proc.c > > > > > > +++ b/fs/nfsd/nfs4proc.c > > > > > > @@ -1972,6 +1972,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > > > > > wake_up_process(async_copy->copy_task); > > > > > > status = nfs_ok; > > > > > > } else { > > > > > > + retry_sync: > > > > > > status = nfsd4_do_copy(copy, copy->nf_src->nf_file, > > > > > > copy->nf_dst->nf_file, true); > > > > > > } > > > > > > @@ -1990,8 +1991,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > > > > > } > > > > > > if (async_copy) > > > > > > cleanup_async_copy(async_copy); > > > > > > - status = nfserr_jukebox; > > > > > > - goto out; > > > > > > + goto retry_sync; > > > > > > } > > > > > > > > > > > > static struct nfsd4_copy * > > > > > > > > > > > > base-commit: 26e6e693936986309c01e8bb80e318d63fda4a44 > > > > > > -- > > > > > > 2.47.0 > > > > > > > > > > > > > > > > -- > > > > > Chuck Lever > > > > > > > > > > > > > > > -- > > > Chuck Lever > > > > > > > -- > Chuck Lever >