Re: hangs during fstests testing with TLS

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Wed, 3 Jan 2024 19:12:42 +0000

> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote:
> 
> This looks like it started out as the problem I've been sending patches to
> fix on 6.7, latest here:
> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/
> 
> .. however whenever I encounter the issue, the client reconnects the
> transport again - so I think there might be an additional problem here.

I'm looking at the same problem as you, Ben. It doesn't seem to be
similar to what Jeff reports.

But I'm wondering if gerry-rigging the timeouts is the right answer
for backchannel replies. The problem, fundamentally, is that when a
forechannel RPC task holds the transport lock, the backchannel's reply
transmit path thinks that means the transport connection is down and
triggers a transport disconnect.

The use of ETIMEDOUT in call_bc_transmit_status() is... not especially
clear.

NFSD's backchannel client has its own set of quirks that make this
situation worse. For example, a reply transmit failure on the client
will screw up the one backchannel session slot, because the client will
have advanced the slot sequence number, but the server will never see
the reply to tell it to do the same.

Jeff says:
> I did turn up all of the sunrpc and NFS client tracepoints, but saw no
> output whatsoever. I also tried turning up all of the dprintk's but that
> also showed nothing.

Try boosting tlshd debugging for the whole test.

Also, the server might be doing something weird, so turn up debugging
there too.

--
Chuck Lever