Re: Spurious instability with NFSoRDMA under moderate load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 11.08.2021 19:30, Chuck Lever III wrote:


On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:

resulting dmesg and trace logs of both client and server are attached.

Test procedure:

- start tracing on client and server
- mount NFS on client
- immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
- wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
- run xfs_io command again, getting stuck now
- let it sit there stuck for a minute, then cancel it
- run the command again
- while it's still stuck, finished recording the logs and traces

The server tries to send CB_OFFLOAD when the offloaded copy
completes, but finds the backchannel transport is not connected.

The server can't report the problem until the client sends a
SEQUENCE operation, but there's really no other traffic going
on, so it just waits.

The client eventually sends a singleton SEQUENCE to renew its
lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
flag set at that point. Client's recovery is to destroy that
session and create a new one. That appears to be successful.

If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work? Cause that's explicitly not the case, once the backchannel initially times out, all subsequent commands get stuck, even if the system is seeing other work on the NFS mount being done in parallel, and no matter how often I re-try and how long I wait in between or with it stuck.

But the server doesn't send another CB_OFFLOAD to let the client
know the copy is complete, so the client hangs.

This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
other CB operations suffer from the same "failed to retransmit
after the CB path is restored" issue. It might not matter for
some of them, but for others like CB_RECALL, that could be
important.


--
Chuck Lever




Attachment: smime.p7s
Description: S/MIME Cryptographic Signature


[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux