On 11 Apr 2024, at 12:36, Dan Aloni wrote: > On 2024-04-11 11:20:14, Benjamin Coddington wrote: >>> So looks like the request is not being retransmitted. Just to be sure, >>> if I cause the nfsd to drop the regular NFS3 prog I/Os like ACCESS and >>> LOOKUP, I only get the expected 5 seconds delay following a successful >>> retry. >>> >>> Seems we only have an issue with the NFS3ACL prog. >> >> It looks like the client_acl program gets created with >> rpc_bind_new_program() which doesn't setup the timeouts/retry strategy, and >> there's nothing after the setup to do it either. >> >> I think the problem has existed since 331702337f2b2.. I think this should >> fix it up, would you like to test it? > > Please allow me to propose a different change, which I already tested. > > Looks to me that the 2012 change that I mentioned below is actually when > the problem started happening, when the `kmemdup` call was removed, so > since then we are simply just missing a field copy, and we are taking > the timeout from transport-based default timeout globals instead. > > Also, I have a hunch that we need to do something different, because of > the following code in `rpc_new_client`: > > clnt->cl_rtt = &clnt->cl_rtt_default; > rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval); > > Only setting `clnt->cl_timeout` _after_ client clone does not seem to be > right if there are `cl_rtt_default` calculations that depend on it. So > may as well need to call `rpc_init_rtt` too. > > -- > > From 55737f82a9bb3e490836d10491995c8082ebcf11 Mon Sep 17 00:00:00 2001 > From: Dan Aloni <dan.aloni@xxxxxxxxxxxx> > Date: Thu, 11 Apr 2024 18:30:56 +0300 > Subject: [PATCH] sunrpc: fix NFSACL RPC retry on soft mount > > It used to be quite awhile ago since 1b63a75180c6 ('SUNRPC: Refactor > rpc_clone_client()'), in 2012, that `cl_timeout` was copied in so that > all mount parameters propagate to NFSACL clients. However since that > change, if mount options as follows are given: > > soft,timeo=50,retrans=16,vers=3 > > The resultant NFSACL client receives: > > cl_softrtry: 1 > cl_timeout: to_initval=60000, to_maxval=60000, to_increment=0, to_retries=2, to_exponential=0 > > These values lead to NFSACL operations not being retried under the > condition of transient network outages with soft mount. Instead, getacl > call fails after 60 seconds with EIO. > > The simple fix is to copy `cl_timeout` and make sure `cl_rtt_default` > is initialized from it. Ah, good catch, and I think this means that the normal IO client (server->client) hasn't been properly setting up its RTT estimator as well? > Cc: Chuck Lever <chuck.lever@xxxxxxxxxx> > Cc: Benjamin Coddington <bcodding@xxxxxxxxxx> > Link: https://lore.kernel.org/all/20231105154857.ryakhmgaptq3hb6b@xxxxxxxxx/T/ > Fixes: 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()') > Signed-off-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx> > --- > net/sunrpc/clnt.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c > index cda0935a68c9..75faf1f05a14 100644 > --- a/net/sunrpc/clnt.c > +++ b/net/sunrpc/clnt.c > @@ -669,6 +669,9 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args, > new->cl_chatty = clnt->cl_chatty; > new->cl_principal = clnt->cl_principal; > new->cl_max_connect = clnt->cl_max_connect; > + new->cl_timeout = clnt->cl_timeout; > + rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval); > + So we'll end up doing rpc_init_rtt twice, first in rpc_new_client then here. Maybe we should be passing cl_timeout on .timeout from the parent in the rpc_clone_client()'s rpc_create_args? .. anyway - I'll stop "helping", you've clearly got this in hand. :) Ben