On Thu, 2024-01-04 at 07:22 -0500, Benjamin Coddington wrote: > On 3 Jan 2024, at 16:46, Chuck Lever III wrote: > > > > On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > > > > > On 3 Jan 2024, at 14:12, Chuck Lever III wrote: > > > > > > > > On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: > > > > > > > > > > This looks like it started out as the problem I've been sending patches to > > > > > fix on 6.7, latest here: > > > > > https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/ > > > > > > > > > > .. however whenever I encounter the issue, the client reconnects the > > > > > transport again - so I think there might be an additional problem here. > > > > > > > > I'm looking at the same problem as you, Ben. It doesn't seem to be > > > > similar to what Jeff reports. > > > > > > > > But I'm wondering if gerry-rigging the timeouts is the right answer > > > > for backchannel replies. The problem, fundamentally, is that when a > > > > forechannel RPC task holds the transport lock, the backchannel's reply > > > > transmit path thinks that means the transport connection is down and > > > > triggers a transport disconnect. > > > > > > Why shouldn't backchannel replies have normal timeout values? > > > > RPC Replies are "send and forget". The server forechannel sends > > its Replies without a timeout. There is no such thing as a > > retransmitted RPC Reply (though a reliable transport might > > retransmit portions of it, the RPC server itself is not aware of > > that). > > > > And I don't see anything in the client's backchannel path that > > makes me think there's a different protocol-level requirement > > in the backchannel. > > Its not strictly a protocol thing, the timeouts are used to decide what to > do with a req or flag the transport state even if the request doesn't make > it to the wire. That's why the zero timeout values for this req improperly > resets the transport. > FWIW, Ben's v3 patchset seems to fix the problem for me, and I was able to run 3 loops of fstests with them in place, whereas before I couldn't even make it through one full run without it hanging. I'm happy to test v4 when you're ready. -- Jeff Layton <jlayton@xxxxxxxxxx>