Hi Trond, On Fri, Apr 17, 2020 at 11:31 AM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > Hi Olga, > > On Fri, 2020-04-17 at 11:15 -0400, Olga Kornievskaia wrote: > > When nconnect is used, SEQUENCE operation currently isn't bound to > > a particular transport. The problem is created on an idle mount, > > where SEQUENCE is the only operation being sent and opened TPC > > connections are slowly being close from the lack of use. If SEQUENCE > > is not assigned to the main connection, the main connection can > > be closed and with that so is the back channel bound to that > > connection. > > > > Since the only way client handles callback_path down is by sending > > BIND_CONN_TO_SESSION requesting to bind both backchannel and fore > > channel on the connection that was left going, but that connection > > was already bound to only forechannel. According to the spec, it's > > not allowed to change channel binding after they are done. > > > > The fix is to make sure that a lone SEQUENCE always goes on the > > main connection, keeping backchannel alive. > > > > Fixes: 5a0c257f8 ("NFS: send state management on a single > > connection") > > Signed-off-by: Olga Kornievskaia <kolga@xxxxxxxxxx> > > --- > > fs/nfs/nfs4proc.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c > > index 99e9f2e..461f85d 100644 > > --- a/fs/nfs/nfs4proc.c > > +++ b/fs/nfs/nfs4proc.c > > @@ -8857,7 +8857,7 @@ static struct rpc_task > > *_nfs41_proc_sequence(struct nfs_client *clp, > > .rpc_client = clp->cl_rpcclient, > > .rpc_message = &msg, > > .callback_ops = &nfs41_sequence_ops, > > - .flags = RPC_TASK_ASYNC | RPC_TASK_TIMEOUT, > > + .flags = RPC_TASK_ASYNC | RPC_TASK_TIMEOUT | > > RPC_TASK_NO_ROUND_ROBIN, > > }; > > struct rpc_task *ret; > > > > This works only in the case where the client is only sending SEQUENCE > instructions. There are other cases where it could be sending out other > operations that also renew the lease, but is doing it very > infrequently. Won't that also run into the same problem? Hm... I see so main channel can still go idle and close, when infrequent operations are happening on the other connections before round-robin-ing to the main connection.... > Is the fundamental problem here that we're not handling the > SEQ4_STATUS_CB_PATH_DOWN / SEQ4_STATUS_CB_PATH_DOWN_SESSION flags > correctly or is there something else going on? Yes the client doesn't recover properly. But the fix wasn't trivial to me (so I thought my patch was enough but I see it's not). Say client shuts down the main connection because it was idle. Now whatever operations goes on a different connection is going to get callback down. The only way the client can create a new backchannel (according to the spec) is if it creates a brand new connection and sends BIND_CONN_TO_SESSION there (all existing connections are already bound to fore channel and according to the spec you can't modify the existing binding). But then we'd need to make sure that it's the first one in the list of connections we iterate thru (as i think 1st marks the main connection?) as the other operations that supposed to only go on main connection need to know which connection to pick. The reason it's not seen against linux is because it doesn't follow the spec is doesn't reject attempts to bind a backchannel to an already existing connection that was only bound for fore channel. > > -- > Trond Myklebust > Linux NFS client maintainer, Hammerspace > trond.myklebust@xxxxxxxxxxxxxxx > >