> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > This looks like it started out as the problem I've been sending patches to > fix on 6.7, latest here: > https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/ > > .. however whenever I encounter the issue, the client reconnects the > transport again - so I think there might be an additional problem here. I'm looking at the same problem as you, Ben. It doesn't seem to be similar to what Jeff reports. But I'm wondering if gerry-rigging the timeouts is the right answer for backchannel replies. The problem, fundamentally, is that when a forechannel RPC task holds the transport lock, the backchannel's reply transmit path thinks that means the transport connection is down and triggers a transport disconnect. The use of ETIMEDOUT in call_bc_transmit_status() is... not especially clear. NFSD's backchannel client has its own set of quirks that make this situation worse. For example, a reply transmit failure on the client will screw up the one backchannel session slot, because the client will have advanced the slot sequence number, but the server will never see the reply to tell it to do the same. Jeff says: > I did turn up all of the sunrpc and NFS client tracepoints, but saw no > output whatsoever. I also tried turning up all of the dprintk's but that > also showed nothing. Try boosting tlshd debugging for the whole test. Also, the server might be doing something weird, so turn up debugging there too. -- Chuck Lever