Re: hangs during fstests testing with TLS

Chuck Lever <chuck.lever@xxxxxxxxxx> · Thu, 4 Jan 2024 09:16:31 -0500

On Thu, Jan 04, 2024 at 07:22:18AM -0500, Benjamin Coddington wrote:
> On 3 Jan 2024, at 16:46, Chuck Lever III wrote:
> 
> >> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> >>
> >> On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
> >>
> >>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> >>>>
> >>>> This looks like it started out as the problem I've been sending patches to
> >>>> fix on 6.7, latest here:
> >>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/
> >>>>
> >>>> .. however whenever I encounter the issue, the client reconnects the
> >>>> transport again - so I think there might be an additional problem here.
> >>>
> >>> I'm looking at the same problem as you, Ben. It doesn't seem to be
> >>> similar to what Jeff reports.
> >>>
> >>> But I'm wondering if gerry-rigging the timeouts is the right answer
> >>> for backchannel replies. The problem, fundamentally, is that when a
> >>> forechannel RPC task holds the transport lock, the backchannel's reply
> >>> transmit path thinks that means the transport connection is down and
> >>> triggers a transport disconnect.
> >>
> >> Why shouldn't backchannel replies have normal timeout values?
> >
> > RPC Replies are "send and forget". The server forechannel sends
> > its Replies without a timeout. There is no such thing as a
> > retransmitted RPC Reply (though a reliable transport might
> > retransmit portions of it, the RPC server itself is not aware of
> > that).
> >
> > And I don't see anything in the client's backchannel path that
> > makes me think there's a different protocol-level requirement
> > in the backchannel.
> 
> Its not strictly a protocol thing, the timeouts are used to decide what to
> do with a req or flag the transport state even if the request doesn't make
> it to the wire.  That's why the zero timeout values for this req improperly
> resets the transport.

I guess I'm harping on this a bit because forechannel v. backchannel
is already confusing enough. The use of timeouts for RPC Replies is
just heaping on to that confusion.

If we're going to keep an explicit timeout when sending the Reply,
it should have a little documentation. I suggest adding this to
xprt_init_bc_request() before the new call to
xprt_init_majortimeo():

	/*
	 * Backchannel Replies are sent with !RPC_TASK_SOFT and
	 * RPC_TASK_NO_RETRANS_TIMEOUT. The major timeout setting
	 * affects only how long each Reply waits to be sent when
	 * a transport connection cannot be established.
	 */
	xprt_init_majortimeo(task, req, ...

HTH

-- 
Chuck Lever