Re: mount options not propagating to NFSACL and NSM RPC clients

Benjamin Coddington <bcodding@xxxxxxxxxx> · Thu, 11 Apr 2024 11:20:14 -0400

On 10 Apr 2024, at 10:39, Dan Aloni wrote:

> On 2023-11-30 09:30:52, Benjamin Coddington wrote:
>>> Actually my concern is the NFSACL prog. With `cl_softrtrt == 1` and
>>> `to_initval == to_maxval`, does it mean retires will not happen
>>> regardless of `to_retries` and `to_increment`?
>>
>> Possibly?  I'm not exactly certain of what should happen in that case.
>>
>>> I encountered a situation where the NFSACL program did not retry but
>>> could have had, whereas NFS3 did successfully. Not sure regarding NSM,
>>> but it seems to me that it would make sense at least for NFSACL to
>>> behave the same as NFS3.
>>
>> I agree, but I could be missing something -- maybe its a bug.  There's the
>> sunrpc:rpc_timeout_status tracepoint that might be helpful.  If you turn
>> that up can you see rpc_check_timeout() getting called from
>> call_transmit_status()?
>
> Sorry, took awhile to get a test working while busy on other stuff.
>
> So it looks really like a bug, here are the details.
>
> Server: nfsd with extra fault injection code that calls `svc_drop()` only once
> on a single NFS GETACL request.
> Client: Linux v6.8, NFS mount with `soft,timeo=50,retrans=16,vers=3`.
>
> I trace client execution with the following:
>
>     sudo perf trace -e sunrpc:rpc_task_timeout -e sunrpc:xprt_retransmit
>
> A simple `ls -l` gets stuck and shows an IO failure:
>
>     [root@client export]# ls -l
>     ls: file: Input/output error
>     total 0
>     -rw-r--r-- 1 root root 0 Apr 10 10:02 file
>
> I get a single event out of the tracing above:
>
> ```
> kthreadd/7926 sunrpc:rpc_task_timeout(task_id: 203, client_id: 6, xprt_id: 3, action: 0xffffffffc0accc60, runstate: 22, flags: 35456)
> ```
>
> So looks like the request is not being retransmitted. Just to be sure,
> if I cause the nfsd to drop the regular NFS3 prog I/Os like ACCESS and
> LOOKUP, I only get the expected 5 seconds delay following a successful
> retry.
>
> Seems we only have an issue with the NFS3ACL prog.

It looks like the client_acl program gets created with
rpc_bind_new_program() which doesn't setup the timeouts/retry strategy, and
there's nothing after the setup to do it either.

I think the problem has existed since 331702337f2b2..  I think this should
fix it up, would you like to test it?

Ben

---

>From 54a35f530d78a8042dc6fb8ff522f5a4a33fa50b Mon Sep 17 00:00:00 2001
Message-ID: <54a35f530d78a8042dc6fb8ff522f5a4a33fa50b.1712848680.git.bcodding@xxxxxxxxxx>
From: Benjamin Coddington <bcodding@xxxxxxxxxx>
Date: Thu, 11 Apr 2024 11:03:06 -0400
Subject: [PATCH] NFS: Set v3 ACL client default timeout values

It appears the client_acl rpc_clnt doesn't take any timeout values, so
retries for the ACL procedures are never attempted.  Set them based on the
mount's default timeout values.

Reported-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx>
Fixes: 331702337f2b ("NFS: Support per-mountpoint timeout parameters.")
Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx>
---
 fs/nfs/nfs3client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c
index b0c8a39c2bbd..66bb1f56c5d9 100644
--- a/fs/nfs/nfs3client.c
+++ b/fs/nfs/nfs3client.c
@@ -33,6 +33,7 @@ static void nfs_init_server_aclclient(struct nfs_server *server)
        if (IS_ERR(server->client_acl))
                goto out_noacl;

+       server->client_acl->cl_timeout = &server->client->cl_timeout_default;
        nfs_sysfs_link_rpc_client(server, server->client_acl, NULL);

        /* No errors! Assume that Sun nfsacls are supported */
-- 
2.44.0