Re: [PATCH] SUNRPC: increase max timeout for rebind to handle NFS server restart

Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> · Tue, 18 Apr 2023 16:02:31 +0000

On Mon, 2023-04-17 at 18:04 -0700, dai.ngo@xxxxxxxxxx wrote:
> 
> On 4/17/23 5:23 PM, Trond Myklebust wrote:
> > task->tk_rebind_retry is _only_ changed if the rpcbind server is up
> > and
> > running, and returns an empty reply because the service we're
> > looking
> > up isn't registered.
> > task->tk_rebind_retry isn't changed on any request timeout. It
> > isn't
> > changed on any connection failure. It isn't changed by any other
> > code
> > path in the RPC client.
> > 
> > So none of this applies to the case of a dead server.
> 
> Sorry if I'm not clear. What I meant by a dead server is a dead NFS
> server and not rpcbind service. So in this case we get EACCES from
> rpcbind and we retry.
> 
> > 
> > It applies to the case of a live server, where rpcbind is running
> > and
> > accessible to the client and where, for some reason or another, it
> > is
> > taking an exceptionally long time to register the service we are
> > looking up the port for (either NLM or NFSv3).
> 
> Yes, this is the problem that I'm facing.
> 
> > 
> > So where are you seeing this process take 90 seconds? Why do we
> > need to
> > wait for that long before we can finally conclude that the
> > particular
> > service in question is not going to come back up?
> 
> 90 secs wait is for when the NFS server never come up and we keep
> getting
> EACCES from rpcbind for this whole time.
> 

OK, so the 90s is completely arbitrary then, and was only chosen
because it fits your particular server?

To me, that appears to invalidate the entire premise of commit
0b760113a3a1 that we can rely on rpcbind to tell us if the service is
present or not.
In that case, I'd rather rip out the task->tk_rebind_retry counter, and
just rely on standard hard/soft task semantics.

-- 
Trond Myklebust
Linux NFS client maintainer, Hammerspace
trond.myklebust@xxxxxxxxxxxxxxx