On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > > > > > On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: > > > > resulting dmesg and trace logs of both client and server are attached. > > > > Test procedure: > > > > - start tracing on client and server > > - mount NFS on client > > - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds) > > - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted) > > - run xfs_io command again, getting stuck now > > - let it sit there stuck for a minute, then cancel it > > - run the command again > > - while it's still stuck, finished recording the logs and traces > > The server tries to send CB_OFFLOAD when the offloaded copy > completes, but finds the backchannel transport is not connected. > > The server can't report the problem until the client sends a > SEQUENCE operation, but there's really no other traffic going > on, so it just waits. > > The client eventually sends a singleton SEQUENCE to renew its > lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT > flag set at that point. Client's recovery is to destroy that > session and create a new one. That appears to be successful. > > But the server doesn't send another CB_OFFLOAD to let the client > know the copy is complete, so the client hangs. > > This seems to be peculiar to COPY_OFFLOAD, but I wonder if the > other CB operations suffer from the same "failed to retransmit > after the CB path is restored" issue. It might not matter for > some of them, but for others like CB_RECALL, that could be > important. Thank you for the analysis Chuck (btw I haven't seen any attachments with Timo's posts so I'm assuming some offline communication must have happened). ? I'm looking at the code and wouldn't the mentioned flags be set on the CB_SEQUENCE operation? nfsd4_cb_done() has code to mark the channel and retry (or another way of saying this, this code should generically handle retrying whatever operation it is be it CB_OFFLOAD or CB_RECALL)? Is that not working (not sure if this is a question or a statement).... I would think that would be the place to handle this problem. > > > -- > Chuck Lever > > >