Re: hangs during fstests testing with TLS

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 04 Jan 2024 07:22:18 -0500

On 3 Jan 2024, at 16:46, Chuck Lever III wrote:

>> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>
>> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
>>
>>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
>>>>
>>>> This looks like it started out as the problem I've been sending patches to
>>>> fix on 6.7, latest here:
>>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/
>>>>
>>>> .. however whenever I encounter the issue, the client reconnects the
>>>> transport again - so I think there might be an additional problem here.
>>>
>>> I'm looking at the same problem as you, Ben. It doesn't seem to be
>>> similar to what Jeff reports.
>>>
>>> But I'm wondering if gerry-rigging the timeouts is the right answer
>>> for backchannel replies. The problem, fundamentally, is that when a
>>> forechannel RPC task holds the transport lock, the backchannel's reply
>>> transmit path thinks that means the transport connection is down and
>>> triggers a transport disconnect.
>>
>> Why shouldn't backchannel replies have normal timeout values?
>
> RPC Replies are "send and forget". The server forechannel sends
> its Replies without a timeout. There is no such thing as a
> retransmitted RPC Reply (though a reliable transport might
> retransmit portions of it, the RPC server itself is not aware of
> that).
>
> And I don't see anything in the client's backchannel path that
> makes me think there's a different protocol-level requirement
> in the backchannel.

Its not strictly a protocol thing, the timeouts are used to decide what to
do with a req or flag the transport state even if the request doesn't make
it to the wire.  That's why the zero timeout values for this req improperly
resets the transport.

Ben