On Tue, Nov 5, 2024 at 4:06 PM NeilBrown <neilb@xxxxxxx> wrote: > > On Wed, 06 Nov 2024, Chuck Lever wrote: > > On Tue, Nov 05, 2024 at 11:32:48AM +1100, NeilBrown wrote: > > > On Tue, 05 Nov 2024, Chuck Lever wrote: > > > > On Tue, Nov 05, 2024 at 07:30:03AM +1100, NeilBrown wrote: > > > > > On Tue, 05 Nov 2024, Chuck Lever wrote: > > > > > > On Mon, Nov 04, 2024 at 03:47:42PM +1100, NeilBrown wrote: > > > > > > > > > > > > > > An NFSv4.2 COPY request can explicitly request a synchronous copy. If > > > > > > > that is not requested then the server is free to perform the copy > > > > > > > synchronously or asynchronously. > > > > > > > > > > > > > > In the Linux implementation an async copy requires more resources than a > > > > > > > sync copy. If nfsd cannot allocate these resources, the best response > > > > > > > is to simply perform the copy (or the first 4MB of it) synchronously. > > > > > > > > > > > > > > This choice may be debatable if the unavailable resource was due to > > > > > > > memory allocation failure - when memalloc fails it might be best to > > > > > > > simply give up as the server is clearly under load. However in the case > > > > > > > that policy prevents another kthread being created there is no benefit > > > > > > > and much cost is failing with NFS4ERR_DELAY. In that case it seems > > > > > > > reasonable to avoid that error in all circumstances. > > > > > > > > > > > > > > So change the out_err case to retry as a sync copy. > > > > > > > > > > > > > > Fixes: aadc3bbea163 ("NFSD: Limit the number of concurrent async COPY operations") > > > > > > > > > > > > Hi Neil, > > > > > > > > > > > > Why is a Fixes: tag necessary? > > > > > > > > > > > > And why that commit? async copies can fail due to lack of resources > > > > > > on kernels that don't have aadc3bbea163, AFAICT. > > > > > > > > > > I had hoped my commit message would have explained that, though I accept > > > > > it was not as explicit as it could be. > > > > > > > > The problem might be that you and I have different understandings of > > > > what exactly aadc3bbea163 does. > > > > > > It might be. > > > My understanding is that it limits the number of concurrent async > > > COPY requests to ->sp_nrthreads and once that limit in reached > > > any further COPY requests that don't explicitly request "synchronous" > > > are refused with NFS4ERR_DELAY. > > > > > > > > kmalloc(GFP_KERNEL) allocation failures aren't interesting. They never > > > > > happen for smallish sizes, and if they do then the server is so borked > > > > > that it hardly matter what we do. > > > > > > > > > > The fixed commit introduces a new failure mode that COULD easily be hit > > > > > in practice. It causes the N+1st COPY to wait indefinitely until at > > > > > least one other copy completes which, as you observed in that commit, > > > > > could "run for a long time". I don't think that behaviour is necessary > > > > > or appropriate. > > > > > > > > The waiting happens on the client. An async COPY operation always > > > > completes quickly on the server, in this case with NFS4ERR_DELAY. It > > > > does not tie up an nfsd thread. > > > > > > Agreed that it doesn't tie up an nfsd thread. It does tie up a separate > > > kthread for which there is a limit matching the number of nfsd threads > > > (in the pool). > > > > > > Agreed that the waiting happens on the client, but why should there be > > > any waiting at all? The client doesn't know what it is waiting for, so > > > will typically wait a few seconds. In that time many megabytes of sync > > > COPY could have been processed. > > > > The Linux NFS client's usual delay is, IIRC, 100 msec with > > exponential backoff. > > Yep, up to 15seconds. > > > > > It's possible that the number of async copies is large because they > > are running on a slow export. Adding more copy work is going to make > > the situation worse -- and by handling the work with a synchronous > > COPY, it will tie up threads that should be available for other > > work. > > Should be available for "work" yes, but why "other work"? Is COPY work > not as important as READ or WRITE or GETATTR work? > READ/WRITE are limited to 1MB, sync-COPY to 4MB so a small difference > that, but it doesn't seem substantial. > > > > > The feedback loop here should result in reducing server workload, > > not increasing it. > > I agree with not increasing it. I don't see the rational for reducing > workload, only for limiting it. > > > > > > > > > By the way, there are two fixes in this area that now appear in > > > > v6.12-rc6 that you should check out. > > > > > > I'll try to schedule time to have a look - thanks. > > > > > > > > Changing the handling for kmalloc failure was just an irrelevant > > > > > side-effect for changing the behaviour when then number of COPY requests > > > > > exceeded the number of configured threads. > > > > > > > > aadc3bbea163 checks the number of concurrent /async/ COPY requests, > > > > which do not tie up nfsd threads, and thus are not limited by the > > > > svc_thread count, as synchronous COPY operations are by definition. > > > > > > They are PRECISELY limited by the svc_thread count. ->sp_nrthreads. > > > > I was describing the situation /before/ aadc3bbea163 , when there > > was no limit at all. > > > > Note that is an arbitrary limit. We could pick something else if > > this limit interferes with the dynamic thread count changes. > > > > > > > My current thinking is that we should not start extra threads for > > > handling async copies. We should create a queue of pending copies and > > > any nfsd thread can dequeue a copy and process 4MB each time through > > > "The main request loop" just like it calls nfsd_file_net_dispose() to do > > > a little bit of work. > > > > Having nfsd threads handle this workload again invites a DoS vector. > > Any more so that having nfsd thread handling a WRITE workload? Isn't a difference between a COPY and a WRITE the fact that the server has the ability to restrict the TCP window of the client sending the bytes. And for simplicity's sake, if we assume client/server has a single TCP stream when the window size is limited then other WRITEs are also prevented from sending more data. But a COPY operation doesn't require much to be sent and preventing the client from sending another COPY can't be achieved thru TCP window size. > > The 4MB chunk limit is there precisely to prevent synchronous COPY > > operations from tying up nfsd threads for too long. On a slow export, > > this is still not going to work, so I'd rather see a timer for this > > protection; say, 30ms, rather than a byte count. If more than 4MB > > can be handled quickly, that will be good for throughput. > > That sounds like a good goal. Ideally we would need a way to negotiate > a window with write-back throttling so that we don't bother reading > until we know that writing to the page-cache won't block. Certainly > worth exploring. > > > > > Note that we still want to limit the number of background copy > > operations going on. I don't want a mechanism where a client can > > start an unbounded amount of work on the server. > > This isn't obvious to me. The server makes no promise concerning the > throughput it will provide. Having a list of COPY requests that add up > to many terabytes isn't intrinsically a problem. Having millions of > COPY requests in the list *might* be a little bit of a burden. Using > __GFP_NORETRY might help put a natural limit on that. > > > > > > > > > > This came up because CVE-2024-49974 was created so I had to do something > > > > > about the theoretical DoS vector in SLE kernels. I didn't like the > > > > > patch so I backported > > > > > > > > > > Commit 8d915bbf3926 ("NFSD: Force all NFSv4.2 COPY requests to be synchronous") I'm doing the same for RHEL. But the fact that CVE was created seems like a flaw of the CVE creation process in this case. It should have never been made a CVE. > > > > > > > > > > instead (and wondered why it hadn't gone to stable). > > > > > > > > I was conservative about requesting a backport here. However, if a > > > > CVE has been filed, and if there is no automation behind that > > > > process, you can explicitly request aadc3bbea163 be backported. > > > > > > > > The problem, to me, was less about server resource depletion and > > > > more about client hangs. > > > > > > And yet the patch that dealt with the less important server resource > > > depletion was marked for stable, and the patch that dealt with client > > > hangs wasn't?? > > > > > > The CVE was for that less important patch, probably because it contained > > > the magic word "DoS". > > > > Quite likely. I wasn't consulted before the CVE was opened, nor was > > I notified that it had been created. > > > > Note that distributions are encouraged to evaluate whether a CVE is > > serious enough to address, rather than simply backporting the fixes > > automatically. But I know some customers want every CVE handled, so > > that is sometimes difficult. > > Yes, it needs to be handled. Declaring it invalid is certainly an > option for handling it. I didn't quite feel I could justify that in > this case. > > > > > > > > I think 8d915bbf3926 should go to stable but I would like to understand > > > why you felt the need to be conservative. > > > > First, I'm told that LTS generally avoids taking backports that > > overtly change user-visible behavior like disabling server-to-server > > copy (which requires async COPY to work). That was the main reason > > for my hesitance. > > Why does server-to-server require async COPY? > RFC 7862 section 4.5. Inter-Server Copy says > The destination server may perform the copy synchronously or > asynchronously. > but I see that nfsd4_copy() returns nfs4err_notsupp if the inter-server > copy_is_sync(), but it isn't immediately obvious to me why. The patch > which landed this functionality doesn't explain the restriction. The choice (my choice) for not implementing a synchronous inter-server copy was from my understanding that such operation would be taking too much time and tie up a knfsd thread. > I guess that with multiple 4MB sync COPY requests the server would need > to repeatedly mount and unmount the source server which could be > unnecessary work - or would need to cache the mount and know when to > unmount it.... The implementation caches the (source server) mount for a period of time. But needing to issue multiple COPY can eventually impact performance. > On the other hand, changing the user-visible behaviour of the client > unexpected hanging waiting for a server-side copy completion > notification that it will never get seems like a user-visible change > that would be desirable. > I'm starting to see why this is awkward. > > > > > But more importantly, the problem with the automatic backport > > mechanism is that marked patches are taken /immediately/ into > > stable. They don't get the kind of soak time that a normally-merged > > unmarked patch gets. The only way to ensure they get any real-world > > test experience at all is to not mark them, and then come back to > > them later and explicitly request a backport. > > > > And, generally, we want to know that a patch destined for LTS > > kernels has actually been applied to and tested on LTS first. > > Automatically backported patches don't get that verification at all. > > I thought it was possible to mark patches to tell the stable team > exactly what you want. Greg certainly seems eager to give maintainers as > much control as they ask for - without requiring them to do anything > they don't want to do. If you have a clear idea of what you want, it > might be good to spell that out and ask how to achieve it. > > > > > My overall preference is that Fixed: patches should be ignored by > > the automation, and that we have a designated NFSD LTS maintainer > > who will test patches on each LTS kernel and request their backport. > > I haven't found anyone to do that work, so we are limping along with > > the current situation. I recognize, however, that this needs to > > improve somehow with only the maintainer resources we have. > > :-) > > Thanks, > NeilBrown > > > > > > > > > > > > Signed-off-by: NeilBrown <neilb@xxxxxxx> > > > > > > > --- > > > > > > > fs/nfsd/nfs4proc.c | 4 ++-- > > > > > > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c > > > > > > > index fea171ffed62..06e0d9153ca9 100644 > > > > > > > --- a/fs/nfsd/nfs4proc.c > > > > > > > +++ b/fs/nfsd/nfs4proc.c > > > > > > > @@ -1972,6 +1972,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > > > > > > wake_up_process(async_copy->copy_task); > > > > > > > status = nfs_ok; > > > > > > > } else { > > > > > > > + retry_sync: > > > > > > > status = nfsd4_do_copy(copy, copy->nf_src->nf_file, > > > > > > > copy->nf_dst->nf_file, true); > > > > > > > } > > > > > > > @@ -1990,8 +1991,7 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > > > > > > } > > > > > > > if (async_copy) > > > > > > > cleanup_async_copy(async_copy); > > > > > > > - status = nfserr_jukebox; > > > > > > > - goto out; > > > > > > > + goto retry_sync; > > > > > > > } > > > > > > > > > > > > > > static struct nfsd4_copy * > > > > > > > > > > > > > > base-commit: 26e6e693936986309c01e8bb80e318d63fda4a44 > > > > > > > -- > > > > > > > 2.47.0 > > > > > > > > > > > > > > > > > > > -- > > > > > > Chuck Lever > > > > > > > > > > > > > > > > > > > -- > > > > Chuck Lever > > > > > > > > > > > -- > > Chuck Lever > > > >