Re: hangs during fstests testing with TLS

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 3 Jan 2024 21:46:08 +0000

> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> 
> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
> 
>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>> 
>>> This looks like it started out as the problem I've been sending patches to
>>> fix on 6.7, latest here:
>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/
>>> 
>>> .. however whenever I encounter the issue, the client reconnects the
>>> transport again - so I think there might be an additional problem here.
>> 
>> I'm looking at the same problem as you, Ben. It doesn't seem to be
>> similar to what Jeff reports.
>> 
>> But I'm wondering if gerry-rigging the timeouts is the right answer
>> for backchannel replies. The problem, fundamentally, is that when a
>> forechannel RPC task holds the transport lock, the backchannel's reply
>> transmit path thinks that means the transport connection is down and
>> triggers a transport disconnect.
> 
> Why shouldn't backchannel replies have normal timeout values?

RPC Replies are "send and forget". The server forechannel sends
its Replies without a timeout. There is no such thing as a
retransmitted RPC Reply (though a reliable transport might
retransmit portions of it, the RPC server itself is not aware of
that).

And I don't see anything in the client's backchannel path that
makes me think there's a different protocol-level requirement
in the backchannel.

>> The use of ETIMEDOUT in call_bc_transmit_status() is... not especially
>> clear.
> 
> Seems like it should mean that the reply couldn't be sent within (what
> should be) the timeout values for the client's state management transport.

It looks to me like backchannel replies are HARD, not SOFT.
That suggests call_bc_transmit_status() is not expecting
ETIMEDOUT to mean "request timed out", which is the usual
meaning of ETIMEDOUT -- and that's only for SOFT RPCs.

See also 762e4e67b356 ("SUNRPC: Refactor RPC call encoding"),
which added RPC_TASK_NO_RETRANS_TIMEOUT to rpc_run_bc_task().

So my "read" of the code is that there currently is no
timeout set, and ETIMEDOUT isn't expected to mean the send
timed out. This code does not rely on a timeout to sent
replies at all. It seems to be reusing the ETIMEDOUT error
that was set as a side-effect of something else.

> I'm glad you're seeing this problem too.  I was worried that something was
> seriously different about my test setup.

For the public record...

Testing with NFSv4.2, I see the client doing async COPY
operations, but then not replying to the server's CB_OFFLOAD
notifications.

The client is not actually dropping the CB_OFFLOADs, though;
it processes them but then never sends the reply because
xprt_prepare_transmit() fails to lock the transport when
sending the reply. The sleep_on_timeout in xprt_reserve_xprt
is falling through because, as you observed, the rqst timeout
settings are all zero.

So the client has processed the CB_OFFLOAD and advanced the
slot sequence number, but because the server doesn't see a
reply it leaves its copy of the slot sequence number alone.

The server sends the next CB_OFFLOAD and the client replies
with NFS4ERR_RETRY_UNCACHED_REP because the slot sequence
numbers no longer match. The client tosses that CB_OFFLOAD
notification.

The server doesn't retransmit it after the new session is
created because DESTROY_SESSION kills the backchannel
rpc_clnt and all its tasks. Or maybe there's some other bug
in there, but I think the server should retransmit and it
doesn't.

So the client copy operation is stuck waiting for a
CB_OFFLOAD notification that never arrives. The client could
send a COPY_NOTIFY, to check on the pending COPY, but it
doesn't.

Certainly the client expects a retransmit. See 3832591e6fa5
("SUNRPC: Handle connection issues correctly on the back
channel"). But I'm not sure NFSD ever did this -- I will
look into addressing that.

But also, we don't want the client spuriously disconnecting
and recycling its sessions just because the write lock is
held by another task right at that moment. That's an enormous
bubble in the throughput pipeline. The problem seems to be
worse when the forechannel is under load. I think
transmitting backchannel replies needs to be more reliable.

--
Chuck Lever