Re: mount options not propagating to NFSACL and NSM RPC clients

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 11 Apr 2024 14:20:43 -0400

On 11 Apr 2024, at 12:36, Dan Aloni wrote:

> On 2024-04-11 11:20:14, Benjamin Coddington wrote:
>>> So looks like the request is not being retransmitted. Just to be sure,
>>> if I cause the nfsd to drop the regular NFS3 prog I/Os like ACCESS and
>>> LOOKUP, I only get the expected 5 seconds delay following a successful
>>> retry.
>>>
>>> Seems we only have an issue with the NFS3ACL prog.
>>
>> It looks like the client_acl program gets created with
>> rpc_bind_new_program() which doesn't setup the timeouts/retry strategy, and
>> there's nothing after the setup to do it either.
>>
>> I think the problem has existed since 331702337f2b2..  I think this should
>> fix it up, would you like to test it?
>
> Please allow me to propose a different change, which I already tested.
>
> Looks to me that the 2012 change that I mentioned below is actually when
> the problem started happening, when the `kmemdup` call was removed, so
> since then we are simply just missing a field copy, and we are taking
> the timeout from transport-based default timeout globals instead.
>
> Also, I have a hunch that we need to do something different, because of
> the following code in `rpc_new_client`:
>
>     clnt->cl_rtt = &clnt->cl_rtt_default;
>     rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval);
>
> Only setting `clnt->cl_timeout` _after_ client clone does not seem to be
> right if there are `cl_rtt_default` calculations that depend on it. So
> may as well need to call `rpc_init_rtt` too.
>
> --
>
> From 55737f82a9bb3e490836d10491995c8082ebcf11 Mon Sep 17 00:00:00 2001
> From: Dan Aloni <dan.aloni@xxxxxxxxxxxx>
> Date: Thu, 11 Apr 2024 18:30:56 +0300
> Subject: [PATCH] sunrpc: fix NFSACL RPC retry on soft mount
>
> It used to be quite awhile ago since 1b63a75180c6 ('SUNRPC: Refactor
> rpc_clone_client()'), in 2012, that `cl_timeout` was copied in so that
> all mount parameters propagate to NFSACL clients. However since that
> change, if mount options as follows are given:
>
>     soft,timeo=50,retrans=16,vers=3
>
> The resultant NFSACL client receives:
>
>     cl_softrtry: 1
>     cl_timeout: to_initval=60000, to_maxval=60000, to_increment=0, to_retries=2, to_exponential=0
>
> These values lead to NFSACL operations not being retried under the
> condition of transient network outages with soft mount. Instead, getacl
> call fails after 60 seconds with EIO.
>
> The simple fix is to copy `cl_timeout` and make sure `cl_rtt_default`
> is initialized from it.

Ah, good catch, and I think this means that the normal IO client
(server->client) hasn't been properly setting up its RTT estimator as well?

> Cc: Chuck Lever <chuck.lever@xxxxxxxxxx>
> Cc: Benjamin Coddington <bcodding@xxxxxxxxxx>
> Link: https://lore.kernel.org/all/20231105154857.ryakhmgaptq3hb6b@xxxxxxxxx/T/
> Fixes: 1b63a75180c6 ('SUNRPC: Refactor rpc_clone_client()')
> Signed-off-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx>
> ---
>  net/sunrpc/clnt.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
> index cda0935a68c9..75faf1f05a14 100644
> --- a/net/sunrpc/clnt.c
> +++ b/net/sunrpc/clnt.c
> @@ -669,6 +669,9 @@ static struct rpc_clnt *__rpc_clone_client(struct rpc_create_args *args,
>  	new->cl_chatty = clnt->cl_chatty;
>  	new->cl_principal = clnt->cl_principal;
>  	new->cl_max_connect = clnt->cl_max_connect;
> +	new->cl_timeout = clnt->cl_timeout;
> +	rpc_init_rtt(&clnt->cl_rtt_default, clnt->cl_timeout->to_initval);
> +

So we'll end up doing rpc_init_rtt twice, first in rpc_new_client then here.
Maybe we should be passing cl_timeout on .timeout from the parent in the
rpc_clone_client()'s rpc_create_args?

.. anyway - I'll stop "helping", you've clearly got this in hand.  :)

Ben