On Thu, 2024-04-11 at 19:36 +0300, Dan Aloni wrote: > On 2024-04-11 11:20:14, Benjamin Coddington wrote: > > > So looks like the request is not being retransmitted. Just to be > > > sure, > > > if I cause the nfsd to drop the regular NFS3 prog I/Os like > > > ACCESS and > > > LOOKUP, I only get the expected 5 seconds delay following a > > > successful > > > retry. > > > > > > Seems we only have an issue with the NFS3ACL prog. > > > > It looks like the client_acl program gets created with > > rpc_bind_new_program() which doesn't setup the timeouts/retry > > strategy, and > > there's nothing after the setup to do it either. > > > > I think the problem has existed since 331702337f2b2.. I think this > > should > > fix it up, would you like to test it? > > Please allow me to propose a different change, which I already > tested. > > Looks to me that the 2012 change that I mentioned below is actually > when > the problem started happening, when the `kmemdup` call was removed, > so > since then we are simply just missing a field copy, and we are taking > the timeout from transport-based default timeout globals instead. > > Also, I have a hunch that we need to do something different, because > of > the following code in `rpc_new_client`: > > clnt->cl_rtt = &clnt->cl_rtt_default; > rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout- > >to_initval); > > Only setting `clnt->cl_timeout` _after_ client clone does not seem to > be > right if there are `cl_rtt_default` calculations that depend on it. > So > may as well need to call `rpc_init_rtt` too. > > -- > > From 55737f82a9bb3e490836d10491995c8082ebcf11 Mon Sep 17 00:00:00 > 2001 > From: Dan Aloni <dan.aloni@xxxxxxxxxxxx> > Date: Thu, 11 Apr 2024 18:30:56 +0300 > Subject: [PATCH] sunrpc: fix NFSACL RPC retry on soft mount > > It used to be quite awhile ago since 1b63a75180c6 ('SUNRPC: Refactor > rpc_clone_client()'), in 2012, that `cl_timeout` was copied in so > that > all mount parameters propagate to NFSACL clients. However since that > change, if mount options as follows are given: > > soft,timeo=50,retrans=16,vers=3 > > The resultant NFSACL client receives: > > cl_softrtry: 1 > cl_timeout: to_initval=60000, to_maxval=60000, to_increment=0, > to_retries=2, to_exponential=0 > > These values lead to NFSACL operations not being retried under the > condition of transient network outages with soft mount. Instead, > getacl > call fails after 60 seconds with EIO. > > The simple fix is to copy `cl_timeout` and make sure `cl_rtt_default` > is initialized from it. > > Cc: Chuck Lever <chuck.lever@xxxxxxxxxx> > Cc: Benjamin Coddington <bcodding@xxxxxxxxxx> > Link: > https://lore.kernel.org/all/20231105154857.ryakhmgaptq3hb6b@xxxxxxxxx/T/ > Fixes: 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()') > Signed-off-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx> > --- > net/sunrpc/clnt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > index cda0935a68c9..75faf1f05a14 100644 > --- a/net/sunrpc/clnt.c > +++ b/net/sunrpc/clnt.c > @@ -669,6 +669,9 @@ static struct rpc_clnt *__rpc_clone_client(struct > rpc_create_args *args, > new->cl_chatty = clnt->cl_chatty; > new->cl_principal = clnt->cl_principal; > new->cl_max_connect = clnt->cl_max_connect; > + new->cl_timeout = clnt->cl_timeout; > + rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout- > >to_initval); > + > return new; > That ends up clobbering any timeout value that gets set in the rpc_create_args, and open codes something that is already being done in the call to rpc_new_client(). IOW: I'd much rather see us default the value of args->timeout to that of clnt->cl_timeout. -- Trond Myklebust Linux NFS client maintainer, Hammerspace trond.myklebust@xxxxxxxxxxxxxxx