Re: Spurious instability with NFSoRDMA under moderate load

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
> 
> On 11.08.2021 19:30, Chuck Lever III wrote:
>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>>> 
>>> resulting dmesg and trace logs of both client and server are attached.
>>> 
>>> Test procedure:
>>> 
>>> - start tracing on client and server
>>> - mount NFS on client
>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>> - run xfs_io command again, getting stuck now
>>> - let it sit there stuck for a minute, then cancel it
>>> - run the command again
>>> - while it's still stuck, finished recording the logs and traces
>> The server tries to send CB_OFFLOAD when the offloaded copy
>> completes, but finds the backchannel transport is not connected.
>> The server can't report the problem until the client sends a
>> SEQUENCE operation, but there's really no other traffic going
>> on, so it just waits.
>> The client eventually sends a singleton SEQUENCE to renew its
>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>> flag set at that point. Client's recovery is to destroy that
>> session and create a new one. That appears to be successful.
> 
> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work?

I would guess that yes, subsequent COPY_OFFLOAD requests
should work unless the backchannel has already timed out
again.

I was about to use your reproducer myself, but a storm
came through on Thursday and knocked out my power and
internet. I'm still waiting for restoration.

Once power is restored I can chase this a little more
efficiently in my lab.


> Cause that's explicitly not the case, once the backchannel initially times out, all subsequent commands get stuck, even if the system is seeing other work on the NFS mount being done in parallel, and no matter how often I re-try and how long I wait in between or with it stuck.
> 
>> But the server doesn't send another CB_OFFLOAD to let the client
>> know the copy is complete, so the client hangs.
>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>> other CB operations suffer from the same "failed to retransmit
>> after the CB path is restored" issue. It might not matter for
>> some of them, but for others like CB_RECALL, that could be
>> important.
>> --
>> Chuck Lever

--
Chuck Lever







[Index of Archives]     [Linux Filesystem Development]     [Linux USB Development]     [Linux Media Development]     [Video for Linux]     [Linux NILFS]     [Linux Audio Users]     [Yosemite Info]     [Linux SCSI]

  Powered by Linux