On 3 Jan 2024, at 16:46, Chuck Lever III wrote: >> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >> >> On 3 Jan 2024, at 14:12, Chuck Lever III wrote: >> >>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>> >>>> This looks like it started out as the problem I've been sending patches to >>>> fix on 6.7, latest here: >>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/ >>>> >>>> .. however whenever I encounter the issue, the client reconnects the >>>> transport again - so I think there might be an additional problem here. >>> >>> I'm looking at the same problem as you, Ben. It doesn't seem to be >>> similar to what Jeff reports. >>> >>> But I'm wondering if gerry-rigging the timeouts is the right answer >>> for backchannel replies. The problem, fundamentally, is that when a >>> forechannel RPC task holds the transport lock, the backchannel's reply >>> transmit path thinks that means the transport connection is down and >>> triggers a transport disconnect. >> >> Why shouldn't backchannel replies have normal timeout values? > > RPC Replies are "send and forget". The server forechannel sends > its Replies without a timeout. There is no such thing as a > retransmitted RPC Reply (though a reliable transport might > retransmit portions of it, the RPC server itself is not aware of > that). > > And I don't see anything in the client's backchannel path that > makes me think there's a different protocol-level requirement > in the backchannel. Its not strictly a protocol thing, the timeouts are used to decide what to do with a req or flag the transport state even if the request doesn't make it to the wire. That's why the zero timeout values for this req improperly resets the transport. Ben