Re: Spurious instability with NFSoRDMA under moderate load

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 11 Aug 2021 20:01:30 +0000

> On Aug 11, 2021, at 3:46 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
> 
> On Wed, Aug 11, 2021 at 2:52 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>> 
>> 
>> 
>>> On Aug 11, 2021, at 2:38 PM, Olga Kornievskaia <aglo@xxxxxxxxx> wrote:
>>> 
>>> On Wed, Aug 11, 2021 at 1:30 PM Chuck Lever III <chuck.lever@xxxxxxxxxx> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Aug 11, 2021, at 12:20 PM, Timo Rothenpieler <timo@xxxxxxxxxxxxxxxx> wrote:
>>>>> 
>>>>> resulting dmesg and trace logs of both client and server are attached.
>>>>> 
>>>>> Test procedure:
>>>>> 
>>>>> - start tracing on client and server
>>>>> - mount NFS on client
>>>>> - immediately run 'xfs_io -fc "copy_range testfile" testfile.copy' (which succeeds)
>>>>> - wait 10~15 minutes for the backchannel to time out (still running 5.12.19 with the fix for that reverted)
>>>>> - run xfs_io command again, getting stuck now
>>>>> - let it sit there stuck for a minute, then cancel it
>>>>> - run the command again
>>>>> - while it's still stuck, finished recording the logs and traces
>>>> 
>>>> The server tries to send CB_OFFLOAD when the offloaded copy
>>>> completes, but finds the backchannel transport is not connected.
>>>> 
>>>> The server can't report the problem until the client sends a
>>>> SEQUENCE operation, but there's really no other traffic going
>>>> on, so it just waits.
>>>> 
>>>> The client eventually sends a singleton SEQUENCE to renew its
>>>> lease. The server replies with the SEQ4_STATUS_BACKCHANNEL_FAULT
>>>> flag set at that point. Client's recovery is to destroy that
>>>> session and create a new one. That appears to be successful.
>>>> 
>>>> But the server doesn't send another CB_OFFLOAD to let the client
>>>> know the copy is complete, so the client hangs.
>>>> 
>>>> This seems to be peculiar to COPY_OFFLOAD, but I wonder if the
>>>> other CB operations suffer from the same "failed to retransmit
>>>> after the CB path is restored" issue. It might not matter for
>>>> some of them, but for others like CB_RECALL, that could be
>>>> important.
>>> 
>>> Thank you for the analysis Chuck (btw I haven't seen any attachments
>>> with Timo's posts so I'm assuming some offline communication must have
>>> happened).
>>> ?
>>> I'm looking at the code and wouldn't the mentioned flags be set on the
>>> CB_SEQUENCE operation?
>> 
>> CB_SEQUENCE is sent from server to client, and that can't work if
>> the callback channel is down.
>> 
>> So the server waits for the client to send a SEQUENCE and it sets
>> the SEQ4_STATUS_BACKCHANNEL_FAULT in its reply.
> 
> yes scratch that, this is for when CB_SEQUENCE has it in its reply.
> 
>>> nfsd4_cb_done() has code to mark the channel
>>> and retry (or another way of saying this, this code should generically
>>> handle retrying whatever operation it is be it CB_OFFLOAD or
>>> CB_RECALL)?
>> 
>> cb_done() marks the callback fault, but as far as I can tell the
>> RPC is terminated at that point and there is no subsequent retry.
>> The RPC_TASK flags on the CB_OFFLOAD operation cause that RPC to
>> fail immediately if there's no connection.
>> 
>> And in the BACKCHANNEL_FAULT case, the bc_xprt is destroyed as
>> part of recovery. I think that would kill all pending RPC tasks.
>> 
>> 
>>> Is that not working (not sure if this is  a question or a
>>> statement).... I would think that would be the place to handle this
>>> problem.
>> 
>> IMHO the OFFLOAD code needs to note that the CB_OFFLOAD RPC
>> failed and then try the call again once the new backchannel is
>> available.
> 
> I still argue that this needs to be done generically not per operation
> as CB_RECALL has the same problem.

Probably not just CB_RECALL, but agreed, there doesn't seem to
be any mechanism that can re-drive callback operations when the
backchannel is replaced.

> If CB_RECALL call is never
> answered, rpc would fail with ETIMEOUT.

Well we have a mechanism already to deal with that case, which is
delegation revocation. That's not optimal, but at least it won't
leave the client hanging forever.

> nfsd4_cb_recall_done() returns
> 1 back to the nfsd4_cb_done() which then based on the rpc task status
> would set the callback path down but will do nothing else.
> nfsd4_cb_offload_one() just always returns 1.
> 
> Now given that you say during the backchannel fault case, we'll have
> to destroy that backchannel and create a new one.

BACKCHANNEL_FAULT is the particular case that Timo's reproducer
exercises. I haven't looked closely at the CB_PATH_DOWN and
CB_PATH_DOWN_SESSION cases, which use BIND_CONN_TO_SESSION,
to see if those also destroy the backchannel transport.

However I suspect they are just as broken. There isn't much
regression testing around these scenarios.

> Are we losing all those RPCs that flagged it as faulty?

I believe the RPC tasks are already gone at that point. I'm
just saying that /hypothetically/ if the RPCs were not set up
to exit immediately, that wouldn't matter because the transport
and client are destroyed as part of restoring the backchannel.

Perhaps instead of destroying the rpc_clnt, the server should
retain it and simply recycle the underlying transport data
structures.

> At this point, nothing obvious
> comes to mind how to engineer the recovery but it should be done.

Well I think we need to:

1. Review the specs to see if there's any indication of how to
recover from this situation

2. Perhaps construct a way to move CB RPCs to a workqueue so
they can be retried indefinitely (or figure out when they should
stop being retried: COPY_OFFLOAD_CANCEL comes to mind).

3. Keep an eye peeled for cases where a malicious or broken
client could cause CB operations to tie up all the server's
nfsd threads or other resources.

--
Chuck Lever