On Mon, May 06, 2024 at 04:37:15PM -0700, Dai Ngo wrote: > > On 5/6/24 2:04 PM, cel@xxxxxxxxxx wrote: > > From: Chuck Lever <chuck.lever@xxxxxxxxxx> > > > > We've discovered that delivering a CB_OFFLOAD operation can be > > unreliable in some pretty unremarkable situations, > > Since the fore and back channel use the same connection so I assume > this is not a connection related problem. This is totally a connection-related problem. The underlying issue is that NFSD does not retransmit backchannel requests when the connection is lost, and the Linux NFS client does not implement OFFLOAD_STATUS. Neither side right now recovers from connection loss while a COPY operation is pending. > Sounds like this is a bug that we should find and fix if possible > instead of work around it. I've been looking for a fix for the past several months. The last fix I put in, you asked me to revert. So, I would prefer to fix the root cause of this issue, but right now the best we can do is create a surgical patch that can be backported to LTS kernels, and keep working on a longer term fix. It's either temporarily force all COPY operations to become synchronous, or temporarily drop support for COPY in NFSD. Actually the latter sounds safer. > Do you know any scenarios where the CB_OFFLOAD operation is > unreliable? Any scenario where the connection is dropped (say, because the server wants the client to retransmit forechannel requests, or because of a GSS sequence number window under-run, or because of a network partition, etc) can potentially result in the loss of a backchannel operation. I can reproduce this issue 100% of the time with an NFSv4.2 mount from a 6.8.7-200.fc39.x86_64 NFS client, using the git regression suite. > > and the Linux > > NFS client does not yet support sending an OFFLOAD_STATUS > > operation to probe whether an asynchronous COPY operation has > > finished. On Linux NFS clients, COPY can hang until manually > > interrupted. > > > > I've tried a couple of remedies, but so far the side-effects are > > worse than the disease. For now, force COPY operations to be > > synchronous so that the use of CB_OFFLOAD is avoided entirely. > > > > I have some patches that add an OFFLOAD_STATUS implementation to the > > Linux NFS client, but that is not likely to fix older clients. > > > > Signed-off-by: Chuck Lever <chuck.lever@xxxxxxxxxx> > > --- > > fs/nfsd/nfs4proc.c | 7 +++++++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c > > index ea3cc3e870a7..12722c709cc6 100644 > > --- a/fs/nfsd/nfs4proc.c > > +++ b/fs/nfsd/nfs4proc.c > > @@ -1807,6 +1807,13 @@ nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate, > > __be32 status; > > struct nfsd4_copy *async_copy = NULL; > > + /* > > + * Currently, async COPY is not reliable. Force all COPY > > + * requests to be synchronous to avoid client application > > + * hangs waiting for completion. > > + */ > > + nfsd4_copy_set_sync(copy, true); > > + > > copy->cp_clp = cstate->clp; > > if (nfsd4_ssc_is_inter(copy)) { > > trace_nfsd_copy_inter(copy); > > > > base-commit: 939cb14d51a150e3c12ef7a8ce0ba04ce6131bd2 -- Chuck Lever