Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status

Jeff Layton <jlayton@xxxxxxxxxx> · Thu, 06 Apr 2023 13:33:30 -0400

On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@xxxxxxxxxx wrote:
> On 3/8/23 11:03 AM, dai.ngo@xxxxxxxxxx wrote:
> > On 3/8/23 10:50 AM, Chuck Lever III wrote:
> > > 
> > > > On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:
> > > > 
> > > > Currently call_bind_status places a hard limit of 3 to the number of
> > > > retries on EACCES error. This limit was done to accommodate the 
> > > > behavior
> > > > of a buggy server that keeps returning garbage when the NLM daemon is
> > > > killed on the NFS server. However this change causes problem for other
> > > > servers that take a little longer than 9 seconds for the port mapper to
> > > > become ready when the NFS server is restarted.
> > > > 
> > > > This patch removes this hard coded limit and let the RPC handles
> > > > the retry according to whether the export is soft or hard mounted.
> > > > 
> > > > To avoid the hang with buggy server, the client can use soft mount for
> > > > the export.
> > > > 
> > > > Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
> > > > Reported-by: Helen Chao <helen.chao@xxxxxxxxxx>
> > > > Tested-by: Helen Chao <helen.chao@xxxxxxxxxx>
> > > > Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
> > > Helen is the royal queen of ^C  ;-)
> > > 
> > > Did you try ^C on a mount while it waits for a rebind?
> > 
> > She uses a test script that restarts the NFS server while NLM lock test
> > is running. The failure is random, sometimes it fails and sometimes it
> > passes depending on when the LOCK/UNLOCK requests come in so I think
> > it's hard to time it to do the ^C, but I will ask.
> 
> We did the test with ^C and here is what we found.
> 
> For synchronous RPC task the signal was delivered to the RPC task and
> the task exit with -ERESTARTSYS from __rpc_execute as expected.
> 
> For asynchronous RPC task the process that invokes the RPC task to send
> the request detected the signal in rpc_wait_for_completion_task and exits
> with -ERESTARTSYS. However the async RPC was allowed to continue to run
> to completion. So if the async RPC task was retrying an operation and
> the NFS server was down, it will retry forever if this is a hard mount
> or until the NFS server comes back up.
> 
> The question for the list is should we propagate the signal to the async
> task via rpc_signal_task to stop its execution or just leave it alone as is.
> 
> 

That is a good question.

I like the patch overall, as it gets rid of a special one-off retry
counter, but I too share some concerns about retrying indefinitely when
an server goes missing.

> 
Propagating a signal seems like the right thing to do. Looks like
rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ?

It sounds pretty straightforward otherwise.
-- 
Jeff Layton <jlayton@xxxxxxxxxx>