We need to hear from the NFS client maintainers about the below question and the patch. > Begin forwarded message: > > From: dai.ngo@xxxxxxxxxx > Subject: Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status > Date: March 14, 2023 at 12:19:30 PM EDT > To: Chuck Lever III <chuck.lever@xxxxxxxxxx> > Cc: Linux NFS Mailing List <linux-nfs@xxxxxxxxxxxxxxx>, Helen Chao <helen.chao@xxxxxxxxxx> > > > On 3/8/23 11:03 AM, dai.ngo@xxxxxxxxxx wrote: >> On 3/8/23 10:50 AM, Chuck Lever III wrote: >>> >>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote: >>>> >>>> Currently call_bind_status places a hard limit of 3 to the number of >>>> retries on EACCES error. This limit was done to accommodate the behavior >>>> of a buggy server that keeps returning garbage when the NLM daemon is >>>> killed on the NFS server. However this change causes problem for other >>>> servers that take a little longer than 9 seconds for the port mapper to >>>> become ready when the NFS server is restarted. >>>> >>>> This patch removes this hard coded limit and let the RPC handles >>>> the retry according to whether the export is soft or hard mounted. >>>> >>>> To avoid the hang with buggy server, the client can use soft mount for >>>> the export. >>>> >>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests") >>>> Reported-by: Helen Chao <helen.chao@xxxxxxxxxx> >>>> Tested-by: Helen Chao <helen.chao@xxxxxxxxxx> >>>> Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx> >>> Helen is the royal queen of ^C ;-) >>> >>> Did you try ^C on a mount while it waits for a rebind? >> >> She uses a test script that restarts the NFS server while NLM lock test >> is running. The failure is random, sometimes it fails and sometimes it >> passes depending on when the LOCK/UNLOCK requests come in so I think >> it's hard to time it to do the ^C, but I will ask. > > We did the test with ^C and here is what we found. > > For synchronous RPC task the signal was delivered to the RPC task and > the task exit with -ERESTARTSYS from __rpc_execute as expected. > > For asynchronous RPC task the process that invokes the RPC task to send > the request detected the signal in rpc_wait_for_completion_task and exits > with -ERESTARTSYS. However the async RPC was allowed to continue to run > to completion. So if the async RPC task was retrying an operation and > the NFS server was down, it will retry forever if this is a hard mount > or until the NFS server comes back up. > > The question for the list is should we propagate the signal to the async > task via rpc_signal_task to stop its execution or just leave it alone as is. > > -Dai > >> >> Thanks, >> -Dai >> >>> >>> >>>> --- >>>> include/linux/sunrpc/sched.h | 3 +-- >>>> net/sunrpc/clnt.c | 3 --- >>>> net/sunrpc/sched.c | 1 - >>>> 3 files changed, 1 insertion(+), 6 deletions(-) >>>> >>>> diff --git a/include/linux/sunrpc/sched.h b/include/linux/sunrpc/sched.h >>>> index b8ca3ecaf8d7..8ada7dc802d3 100644 >>>> --- a/include/linux/sunrpc/sched.h >>>> +++ b/include/linux/sunrpc/sched.h >>>> @@ -90,8 +90,7 @@ struct rpc_task { >>>> #endif >>>> unsigned char tk_priority : 2,/* Task priority */ >>>> tk_garb_retry : 2, >>>> - tk_cred_retry : 2, >>>> - tk_rebind_retry : 2; >>>> + tk_cred_retry : 2; >>>> }; >>>> >>>> typedef void (*rpc_action)(struct rpc_task *); >>>> diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c >>>> index 0b0b9f1eed46..63b438d8564b 100644 >>>> --- a/net/sunrpc/clnt.c >>>> +++ b/net/sunrpc/clnt.c >>>> @@ -2050,9 +2050,6 @@ call_bind_status(struct rpc_task *task) >>>> status = -EOPNOTSUPP; >>>> break; >>>> } >>>> - if (task->tk_rebind_retry == 0) >>>> - break; >>>> - task->tk_rebind_retry--; >>>> rpc_delay(task, 3*HZ); >>>> goto retry_timeout; >>>> case -ENOBUFS: >>>> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c >>>> index be587a308e05..c8321de341ee 100644 >>>> --- a/net/sunrpc/sched.c >>>> +++ b/net/sunrpc/sched.c >>>> @@ -817,7 +817,6 @@ rpc_init_task_statistics(struct rpc_task *task) >>>> /* Initialize retry counters */ >>>> task->tk_garb_retry = 2; >>>> task->tk_cred_retry = 2; >>>> - task->tk_rebind_retry = 2; >>>> >>>> /* starting timestamp */ >>>> task->tk_start = ktime_get(); >>>> -- >>>> 2.9.5 >>>> >>> -- >>> Chuck Lever -- Chuck Lever