Re: [PATCH] SUNRPC: remove the maximum number of retries in call_bind_status

Chuck Lever III <chuck.lever@xxxxxxxxxx> · Thu, 6 Apr 2023 19:43:28 +0000

> On Apr 6, 2023, at 3:36 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:
> 
> Hi Jeff,
> 
> Thank you for taking a look at the patch.
> 
> On 4/6/23 11:10 AM, Jeff Layton wrote:
>> On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote:
>>> On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@xxxxxxxxxx wrote:
>>>> On 3/8/23 11:03 AM, dai.ngo@xxxxxxxxxx wrote:
>>>>> On 3/8/23 10:50 AM, Chuck Lever III wrote:
>>>>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote:
>>>>>>> 
>>>>>>> Currently call_bind_status places a hard limit of 3 to the number of
>>>>>>> retries on EACCES error. This limit was done to accommodate the
>>>>>>> behavior
>>>>>>> of a buggy server that keeps returning garbage when the NLM daemon is
>>>>>>> killed on the NFS server. However this change causes problem for other
>>>>>>> servers that take a little longer than 9 seconds for the port mapper to
>>>>>>> 
>>>>>>> 
>> 
>> Actually, the EACCES error means that the host doesn't have the port
>> registered.
> 
> Yes, our SQA team runs stress lock test and restart the NFS server.
> Sometimes the NFS server starts up and register to the port mapper in
> time and there is no problem but occasionally it's late coming up causing
> this EACCES error.
> 
>>  That could happen if (e.g.) the host had a NFSv3 mount up
>> with an NLM connection and then crashed and rebooted and didn't remount
>> it.
> 
> can you please explain this scenario I don't quite follow it. If the v3
> client crashed and did not remount the export then how can the client try
> to access/lock anything on the server? I must have missing something here.
> 
>>  
>>>>>>> become ready when the NFS server is restarted.
>>>>>>> 
>>>>>>> This patch removes this hard coded limit and let the RPC handles
>>>>>>> the retry according to whether the export is soft or hard mounted.
>>>>>>> 
>>>>>>> To avoid the hang with buggy server, the client can use soft mount for
>>>>>>> the export.
>>>>>>> 
>>>>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests")
>>>>>>> Reported-by: Helen Chao <helen.chao@xxxxxxxxxx>
>>>>>>> Tested-by: Helen Chao <helen.chao@xxxxxxxxxx>
>>>>>>> Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx>
>>>>>> Helen is the royal queen of ^C  ;-)
>>>>>> 
>>>>>> Did you try ^C on a mount while it waits for a rebind?
>>>>> She uses a test script that restarts the NFS server while NLM lock test
>>>>> is running. The failure is random, sometimes it fails and sometimes it
>>>>> passes depending on when the LOCK/UNLOCK requests come in so I think
>>>>> it's hard to time it to do the ^C, but I will ask.
>>>> We did the test with ^C and here is what we found.
>>>> 
>>>> For synchronous RPC task the signal was delivered to the RPC task and
>>>> the task exit with -ERESTARTSYS from __rpc_execute as expected.
>>>> 
>>>> For asynchronous RPC task the process that invokes the RPC task to send
>>>> the request detected the signal in rpc_wait_for_completion_task and exits
>>>> with -ERESTARTSYS. However the async RPC was allowed to continue to run
>>>> to completion. So if the async RPC task was retrying an operation and
>>>> the NFS server was down, it will retry forever if this is a hard mount
>>>> or until the NFS server comes back up.
>>>> 
>>>> The question for the list is should we propagate the signal to the async
>>>> task via rpc_signal_task to stop its execution or just leave it alone as is.
>>>> 
>>>> 
>>> That is a good question.
> 
> Maybe we should defer the propagation of the signal for later. Chuck has
> some concern since this can change the existing behavior of async task
> for other v4 operations.
> 
>>> 
>>> I like the patch overall, as it gets rid of a special one-off retry
>>> counter, but I too share some concerns about retrying indefinitely when
>>> an server goes missing.
>>> 
>>> Propagating a signal seems like the right thing to do. Looks like
>>> rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ?
>>> 
>>> It sounds pretty straightforward otherwise.
>> Erm, except that some of these xprts are in the context of the server.
>> 
>> For instance: server-side lockd sometimes has to send messages to the
>> clients (e.g. GRANTED callbacks). Suppose we're trying to send a message
>> to the client, but it has crashed and rebooted...or maybe the client's
>> address got handed to another host that isn't doing NFS at all so the
>> NLM service will never come back.
>> 
>> This will mean that those RPCs will retry forever now in this situation.
>> I'm not sure that's what we want. Maybe we need some way to distinguish
>> between "user-driven" RPCs and those that aren't?
>> 
>> As a simpler workaround, would it work to just increase the number of
>> retries here? There's nothing magical about 3 tries. ISTM that having it
>> retry enough times to cover at least a minute or two would be
>> acceptable.
> 
> I'm happy with the simple workaround by just increasing the number of
> retries. This would fix the immediate problem that we're facing without
> potential of breaking things in other areas. If you agree then I will
> prepare the simpler patch for this.

A longer retry period is a short-term solution. I can get behind
that as long as the patch description makes clear some of the
concerns that have been brought up in this email thread.

--
Chuck Lever