On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@xxxxxxxxxx wrote: > On 3/8/23 11:03 AM, dai.ngo@xxxxxxxxxx wrote: > > On 3/8/23 10:50 AM, Chuck Lever III wrote: > > > > > > > On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote: > > > > > > > > Currently call_bind_status places a hard limit of 3 to the number of > > > > retries on EACCES error. This limit was done to accommodate the > > > > behavior > > > > of a buggy server that keeps returning garbage when the NLM daemon is > > > > killed on the NFS server. However this change causes problem for other > > > > servers that take a little longer than 9 seconds for the port mapper to > > > > become ready when the NFS server is restarted. > > > > > > > > This patch removes this hard coded limit and let the RPC handles > > > > the retry according to whether the export is soft or hard mounted. > > > > > > > > To avoid the hang with buggy server, the client can use soft mount for > > > > the export. > > > > > > > > Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests") > > > > Reported-by: Helen Chao <helen.chao@xxxxxxxxxx> > > > > Tested-by: Helen Chao <helen.chao@xxxxxxxxxx> > > > > Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx> > > > Helen is the royal queen of ^C ;-) > > > > > > Did you try ^C on a mount while it waits for a rebind? > > > > She uses a test script that restarts the NFS server while NLM lock test > > is running. The failure is random, sometimes it fails and sometimes it > > passes depending on when the LOCK/UNLOCK requests come in so I think > > it's hard to time it to do the ^C, but I will ask. > > We did the test with ^C and here is what we found. > > For synchronous RPC task the signal was delivered to the RPC task and > the task exit with -ERESTARTSYS from __rpc_execute as expected. > > For asynchronous RPC task the process that invokes the RPC task to send > the request detected the signal in rpc_wait_for_completion_task and exits > with -ERESTARTSYS. However the async RPC was allowed to continue to run > to completion. So if the async RPC task was retrying an operation and > the NFS server was down, it will retry forever if this is a hard mount > or until the NFS server comes back up. > > The question for the list is should we propagate the signal to the async > task via rpc_signal_task to stop its execution or just leave it alone as is. > > That is a good question. I like the patch overall, as it gets rid of a special one-off retry counter, but I too share some concerns about retrying indefinitely when an server goes missing. > Propagating a signal seems like the right thing to do. Looks like rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ? It sounds pretty straightforward otherwise. -- Jeff Layton <jlayton@xxxxxxxxxx>