On 2024-04-11 11:20:14, Benjamin Coddington wrote: > > So looks like the request is not being retransmitted. Just to be sure, > > if I cause the nfsd to drop the regular NFS3 prog I/Os like ACCESS and > > LOOKUP, I only get the expected 5 seconds delay following a successful > > retry. > > > > Seems we only have an issue with the NFS3ACL prog. > > It looks like the client_acl program gets created with > rpc_bind_new_program() which doesn't setup the timeouts/retry strategy, and > there's nothing after the setup to do it either. > > I think the problem has existed since 331702337f2b2.. I think this should > fix it up, would you like to test it? Please allow me to propose a different change, which I already tested. Looks to me that the 2012 change that I mentioned below is actually when the problem started happening, when the `kmemdup` call was removed, so since then we are simply just missing a field copy, and we are taking the timeout from transport-based default timeout globals instead. Also, I have a hunch that we need to do something different, because of the following code in `rpc_new_client`: clnt->cl_rtt = &clnt->cl_rtt_default; rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval); Only setting `clnt->cl_timeout` _after_ client clone does not seem to be right if there are `cl_rtt_default` calculations that depend on it. So may as well need to call `rpc_init_rtt` too. -- >From 55737f82a9bb3e490836d10491995c8082ebcf11 Mon Sep 17 00:00:00 2001 From: Dan Aloni <dan.aloni@xxxxxxxxxxxx> Date: Thu, 11 Apr 2024 18:30:56 +0300 Subject: [PATCH] sunrpc: fix NFSACL RPC retry on soft mount It used to be quite awhile ago since 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()'), in 2012, that `cl_timeout` was copied in so that all mount parameters propagate to NFSACL clients. However since that change, if mount options as follows are given: soft,timeo=50,retrans=16,vers=3 The resultant NFSACL client receives: cl_softrtry: 1 cl_timeout: to_initval=60000, to_maxval=60000, to_increment=0, to_retries=2, to_exponential=0 These values lead to NFSACL operations not being retried under the condition of transient network outages with soft mount. Instead, getacl call fails after 60 seconds with EIO. The simple fix is to copy `cl_timeout` and make sure `cl_rtt_default` is initialized from it. Cc: Chuck Lever <chuck.lever@xxxxxxxxxx> Cc: Benjamin Coddington <bcodding@xxxxxxxxxx> Link: https://lore.kernel.org/all/20231105154857.ryakhmgaptq3hb6b@xxxxxxxxx/T/ Fixes: 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()') Signed-off-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx> --- net/sunrpc/clnt.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c index cda0935a68c9..75faf1f05a14 100644 --- a/net/sunrpc/clnt.c +++ b/net/sunrpc/clnt.c @@ -669,6 +669,9 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args, new->cl_chatty = clnt->cl_chatty; new->cl_principal = clnt->cl_principal; new->cl_max_connect = clnt->cl_max_connect; + new->cl_timeout = clnt->cl_timeout; + rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval); + return new; out_err: -- 2.39.3 -- Dan Aloni