Re: hangs during fstests testing with TLS

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 04 Jan 2024 08:42:25 -0500

On Thu, 2024-01-04 at 07:22 -0500, Benjamin Coddington wrote:
> On 3 Jan 2024, at 16:46, Chuck Lever III wrote:
> 
> > > On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> > > 
> > > On 3 Jan 2024, at 14:12, Chuck Lever III wrote:
> > > 
> > > > > On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> > > > > 
> > > > > This looks like it started out as the problem I've been sending patches to
> > > > > fix on 6.7, latest here:
> > > > > https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/
> > > > > 
> > > > > .. however whenever I encounter the issue, the client reconnects the
> > > > > transport again - so I think there might be an additional problem here.
> > > > 
> > > > I'm looking at the same problem as you, Ben. It doesn't seem to be
> > > > similar to what Jeff reports.
> > > > 
> > > > But I'm wondering if gerry-rigging the timeouts is the right answer
> > > > for backchannel replies. The problem, fundamentally, is that when a
> > > > forechannel RPC task holds the transport lock, the backchannel's reply
> > > > transmit path thinks that means the transport connection is down and
> > > > triggers a transport disconnect.
> > > 
> > > Why shouldn't backchannel replies have normal timeout values?
> > 
> > RPC Replies are "send and forget". The server forechannel sends
> > its Replies without a timeout. There is no such thing as a
> > retransmitted RPC Reply (though a reliable transport might
> > retransmit portions of it, the RPC server itself is not aware of
> > that).
> > 
> > And I don't see anything in the client's backchannel path that
> > makes me think there's a different protocol-level requirement
> > in the backchannel.
> 
> Its not strictly a protocol thing, the timeouts are used to decide what to
> do with a req or flag the transport state even if the request doesn't make
> it to the wire.  That's why the zero timeout values for this req improperly
> resets the transport.
> 

FWIW, Ben's v3 patchset seems to fix the problem for me, and I was able
to run 3 loops of fstests with them in place, whereas before I couldn't
even make it through one full run without it hanging.

I'm happy to test v4 when you're ready.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>