On 4 Jan 2024, at 9:16, Chuck Lever wrote: > On Thu, Jan 04, 2024 at 07:22:18AM -0500, Benjamin Coddington wrote: >> On 3 Jan 2024, at 16:46, Chuck Lever III wrote: >> >>>> On Jan 3, 2024, at 3:16 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>> >>>> On 3 Jan 2024, at 14:12, Chuck Lever III wrote: >>>> >>>>>> On Jan 3, 2024, at 1:47 PM, Benjamin Coddington <bcodding@xxxxxxxxxx> wrote: >>>>>> >>>>>> This looks like it started out as the problem I've been sending patches to >>>>>> fix on 6.7, latest here: >>>>>> https://lore.kernel.org/linux-nfs/e28038fba1243f00b0dd66b7c5296a1e181645ea.1702496910.git.bcodding@xxxxxxxxxx/ >>>>>> >>>>>> .. however whenever I encounter the issue, the client reconnects the >>>>>> transport again - so I think there might be an additional problem here. >>>>> >>>>> I'm looking at the same problem as you, Ben. It doesn't seem to be >>>>> similar to what Jeff reports. >>>>> >>>>> But I'm wondering if gerry-rigging the timeouts is the right answer >>>>> for backchannel replies. The problem, fundamentally, is that when a >>>>> forechannel RPC task holds the transport lock, the backchannel's reply >>>>> transmit path thinks that means the transport connection is down and >>>>> triggers a transport disconnect. >>>> >>>> Why shouldn't backchannel replies have normal timeout values? >>> >>> RPC Replies are "send and forget". The server forechannel sends >>> its Replies without a timeout. There is no such thing as a >>> retransmitted RPC Reply (though a reliable transport might >>> retransmit portions of it, the RPC server itself is not aware of >>> that). >>> >>> And I don't see anything in the client's backchannel path that >>> makes me think there's a different protocol-level requirement >>> in the backchannel. >> >> Its not strictly a protocol thing, the timeouts are used to decide what to >> do with a req or flag the transport state even if the request doesn't make >> it to the wire. That's why the zero timeout values for this req improperly >> resets the transport. > > I guess I'm harping on this a bit because forechannel v. backchannel > is already confusing enough. The use of timeouts for RPC Replies is > just heaping on to that confusion. Yeah, its super confusing for me too. Add to this the fact that we do it completely different for 4.0 and 4.1, but all the variable names are similar and/or reversed. > If we're going to keep an explicit timeout when sending the Reply, > it should have a little documentation. I suggest adding this to > xprt_init_bc_request() before the new call to > xprt_init_majortimeo(): > > /* > * Backchannel Replies are sent with !RPC_TASK_SOFT and > * RPC_TASK_NO_RETRANS_TIMEOUT. The major timeout setting > * affects only how long each Reply waits to be sent when > * a transport connection cannot be established. > */ > xprt_init_majortimeo(task, req, ... Ok, I will add that on 2/2 v4 which is getting some basic testing ATM. Ben