> On Aug 16, 2021, at 9:26 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote: > >> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: >> >> On 11.08.2021 19:30, Chuck Lever III wrote: >>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote: >>>> >>>> resulting dmesg and trace logs of both client and server are attached. >>>> >>>> Test procedure: >>>> >>>> - start tracing on client and server >>>> - mount NFS on client >>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds) >>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted) >>>> - run xfs_io command again, getting stuck now >>>> - let it sit there stuck for a minute, then cancel it >>>> - run the command again >>>> - while it's still stuck, finished recording the logs and traces >>> The server tries to send CB_OFFLOAD when the offloaded copy >>> completes, but finds the backchannel transport is not connected. >>> The server can't report the problem until the client sends a >>> SEQUENCE operation, but there's really no other traffic going >>> on, so it just waits. >>> The client eventually sends a singleton SEQUENCE to renew its >>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT >>> flag set at that point. Client's recovery is to destroy that >>> session and create a new one. That appears to be successful. >> >> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work? > > I would guess that yes, subsequent COPY_OFFLOAD requests > should work unless the backchannel has already timed out > again. > > I was about to use your reproducer myself, but a storm > came through on Thursday and knocked out my power and > internet. I'm still waiting for restoration. > > Once power is restored I can chase this a little more > efficiently in my lab. OK, I think the issue with this reproducer was resolved completely with 6820bf77864d. I went back and reviewed the traces from when the client got stuck after a long uptime. This looks very different from what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN and BIND_CONN_TO_SESSION, which is a different scenario. Long story short, I don't think we're getting any more value by leaving 6820bf77864d reverted. Can you re-apply that commit on your server, and then when the client hangs again, please capture with: # trace-cmd record -e nfsd -e sunrpc -e rpcrdma I'd like to see why the client's BIND_CONN_TO_SESSION fails to repair the backchannel session. -- Chuck Lever