On 10 Apr 2024, at 10:39, Dan Aloni wrote: > On 2023-11-30 09:30:52, Benjamin Coddington wrote: >>> Actually my concern is the NFSACL prog. With `cl_softrtrt == 1` and >>> `to_initval == to_maxval`, does it mean retires will not happen >>> regardless of `to_retries` and `to_increment`? >> >> Possibly? I'm not exactly certain of what should happen in that case. >> >>> I encountered a situation where the NFSACL program did not retry but >>> could have had, whereas NFS3 did successfully. Not sure regarding NSM, >>> but it seems to me that it would make sense at least for NFSACL to >>> behave the same as NFS3. >> >> I agree, but I could be missing something -- maybe its a bug. There's the >> sunrpc:rpc_timeout_status tracepoint that might be helpful. If you turn >> that up can you see rpc_check_timeout() getting called from >> call_transmit_status()? > > Sorry, took awhile to get a test working while busy on other stuff. > > So it looks really like a bug, here are the details. > > Server: nfsd with extra fault injection code that calls `svc_drop()` only once > on a single NFS GETACL request. > Client: Linux v6.8, NFS mount with `soft,timeo=50,retrans=16,vers=3`. > > I trace client execution with the following: > > sudo perf trace -e sunrpc:rpc_task_timeout -e sunrpc:xprt_retransmit > > A simple `ls -l` gets stuck and shows an IO failure: > > [root@client export]# ls -l > ls: file: Input/output error > total 0 > -rw-r--r-- 1 root root 0 Apr 10 10:02 file > > I get a single event out of the tracing above: > > ``` > kthreadd/7926 sunrpc:rpc_task_timeout(task_id: 203, client_id: 6, xprt_id: 3, action: 0xffffffffc0accc60, runstate: 22, flags: 35456) > ``` > > So looks like the request is not being retransmitted. Just to be sure, > if I cause the nfsd to drop the regular NFS3 prog I/Os like ACCESS and > LOOKUP, I only get the expected 5 seconds delay following a successful > retry. > > Seems we only have an issue with the NFS3ACL prog. It looks like the client_acl program gets created with rpc_bind_new_program() which doesn't setup the timeouts/retry strategy, and there's nothing after the setup to do it either. I think the problem has existed since 331702337f2b2.. I think this should fix it up, would you like to test it? Ben --- >From 54a35f530d78a8042dc6fb8ff522f5a4a33fa50b Mon Sep 17 00:00:00 2001 Message-ID: <54a35f530d78a8042dc6fb8ff522f5a4a33fa50b.1712848680.git.bcodding@xxxxxxxxxx> From: Benjamin Coddington <bcodding@xxxxxxxxxx> Date: Thu, 11 Apr 2024 11:03:06 -0400 Subject: [PATCH] NFS: Set v3 ACL client default timeout values It appears the client_acl rpc_clnt doesn't take any timeout values, so retries for the ACL procedures are never attempted. Set them based on the mount's default timeout values. Reported-by: Dan Aloni <dan.aloni@xxxxxxxxxxxx> Fixes: 331702337f2b ("NFS: Support per-mountpoint timeout parameters.") Signed-off-by: Benjamin Coddington <bcodding@xxxxxxxxxx> --- fs/nfs/nfs3client.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/nfs/nfs3client.c b/fs/nfs/nfs3client.c index b0c8a39c2bbd..66bb1f56c5d9 100644 --- a/fs/nfs/nfs3client.c +++ b/fs/nfs/nfs3client.c @@ -33,6 +33,7 @@ static void nfs_init_server_aclclient(struct nfs_server *server) if (IS_ERR(server->client_acl)) goto out_noacl; + server->client_acl->cl_timeout = &server->client->cl_timeout_default; nfs_sysfs_link_rpc_client(server, server->client_acl, NULL); /* No errors! Assume that Sun nfsacls are supported */ -- 2.44.0