Re: Spurious instability with NFSoRDMA under moderate load

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Fri, 20 Aug 2021 15:12:04 +0000

> On Aug 16, 2021, at 9:26 AM, Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
> 
>> On Aug 12, 2021, at 2:13 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>> 
>> On 11.08.2021 19:30, Chuck Lever III wrote:
>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>>>> 
>>>> resulting dmesg and trace logs of both client and server are attached.
>>>> 
>>>> Test procedure:
>>>> 
>>>> - start tracing on client and server
>>>> - mount NFS on client
>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>>> - run xfs_io command again, getting stuck now
>>>> - let it sit there stuck for a minute, then cancel it
>>>> - run the command again
>>>> - while it's still stuck, finished recording the logs and traces
>>> The server tries to send CB_OFFLOAD when the offloaded copy
>>> completes, but finds the backchannel transport is not connected.
>>> The server can't report the problem until the client sends a
>>> SEQUENCE operation, but there's really no other traffic going
>>> on, so it just waits.
>>> The client eventually sends a singleton SEQUENCE to renew its
>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>>> flag set at that point. Client's recovery is to destroy that
>>> session and create a new one. That appears to be successful.
>> 
>> If it re-created the session and the backchannel, shouldn't that mean that after I cancel the first stuck xfs_io command, and try it again immediately (before the backchannel had a chance to timeout again) it should work?
> 
> I would guess that yes, subsequent COPY_OFFLOAD requests
> should work unless the backchannel has already timed out
> again.
> 
> I was about to use your reproducer myself, but a storm
> came through on Thursday and knocked out my power and
> internet. I'm still waiting for restoration.
> 
> Once power is restored I can chase this a little more
> efficiently in my lab.

OK, I think the issue with this reproducer was resolved
completely with 6820bf77864d.

I went back and reviewed the traces from when the client got
stuck after a long uptime. This looks very different from
what we're seeing with 6820bf77864d. It involves CB_PATH_DOWN
and BIND_CONN_TO_SESSION, which is a different scenario. Long
story short, I don't think we're getting any more value by
leaving 6820bf77864d reverted.

Can you re-apply that commit on your server, and then when
the client hangs again, please capture with:

# trace-cmd record -e nfsd -e sunrpc -e rpcrdma

I'd like to see why the client's BIND_CONN_TO_SESSION fails
to repair the backchannel session.

--
Chuck Lever