> On Apr 6, 2023, at 3:36 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote: > > Hi Jeff, > > Thank you for taking a look at the patch. > > On 4/6/23 11:10 AM, Jeff Layton wrote: >> On Thu, 2023-04-06 at 13:33 -0400, Jeff Layton wrote: >>> On Tue, 2023-03-14 at 09:19 -0700, dai.ngo@xxxxxxxxxx wrote: >>>> On 3/8/23 11:03 AM, dai.ngo@xxxxxxxxxx wrote: >>>>> On 3/8/23 10:50 AM, Chuck Lever III wrote: >>>>>>> On Mar 8, 2023, at 1:45 PM, Dai Ngo <dai.ngo@xxxxxxxxxx> wrote: >>>>>>> >>>>>>> Currently call_bind_status places a hard limit of 3 to the number of >>>>>>> retries on EACCES error. This limit was done to accommodate the >>>>>>> behavior >>>>>>> of a buggy server that keeps returning garbage when the NLM daemon is >>>>>>> killed on the NFS server. However this change causes problem for other >>>>>>> servers that take a little longer than 9 seconds for the port mapper to >>>>>>> >>>>>>> >> >> Actually, the EACCES error means that the host doesn't have the port >> registered. > > Yes, our SQA team runs stress lock test and restart the NFS server. > Sometimes the NFS server starts up and register to the port mapper in > time and there is no problem but occasionally it's late coming up causing > this EACCES error. > >> That could happen if (e.g.) the host had a NFSv3 mount up >> with an NLM connection and then crashed and rebooted and didn't remount >> it. > > can you please explain this scenario I don't quite follow it. If the v3 > client crashed and did not remount the export then how can the client try > to access/lock anything on the server? I must have missing something here. > >> >>>>>>> become ready when the NFS server is restarted. >>>>>>> >>>>>>> This patch removes this hard coded limit and let the RPC handles >>>>>>> the retry according to whether the export is soft or hard mounted. >>>>>>> >>>>>>> To avoid the hang with buggy server, the client can use soft mount for >>>>>>> the export. >>>>>>> >>>>>>> Fixes: 0b760113a3a1 ("NLM: Don't hang forever on NLM unlock requests") >>>>>>> Reported-by: Helen Chao <helen.chao@xxxxxxxxxx> >>>>>>> Tested-by: Helen Chao <helen.chao@xxxxxxxxxx> >>>>>>> Signed-off-by: Dai Ngo <dai.ngo@xxxxxxxxxx> >>>>>> Helen is the royal queen of ^C ;-) >>>>>> >>>>>> Did you try ^C on a mount while it waits for a rebind? >>>>> She uses a test script that restarts the NFS server while NLM lock test >>>>> is running. The failure is random, sometimes it fails and sometimes it >>>>> passes depending on when the LOCK/UNLOCK requests come in so I think >>>>> it's hard to time it to do the ^C, but I will ask. >>>> We did the test with ^C and here is what we found. >>>> >>>> For synchronous RPC task the signal was delivered to the RPC task and >>>> the task exit with -ERESTARTSYS from __rpc_execute as expected. >>>> >>>> For asynchronous RPC task the process that invokes the RPC task to send >>>> the request detected the signal in rpc_wait_for_completion_task and exits >>>> with -ERESTARTSYS. However the async RPC was allowed to continue to run >>>> to completion. So if the async RPC task was retrying an operation and >>>> the NFS server was down, it will retry forever if this is a hard mount >>>> or until the NFS server comes back up. >>>> >>>> The question for the list is should we propagate the signal to the async >>>> task via rpc_signal_task to stop its execution or just leave it alone as is. >>>> >>>> >>> That is a good question. > > Maybe we should defer the propagation of the signal for later. Chuck has > some concern since this can change the existing behavior of async task > for other v4 operations. > >>> >>> I like the patch overall, as it gets rid of a special one-off retry >>> counter, but I too share some concerns about retrying indefinitely when >>> an server goes missing. >>> >>> Propagating a signal seems like the right thing to do. Looks like >>> rpcb_getport_done would also need to grow a check for RPC_SIGNALLED ? >>> >>> It sounds pretty straightforward otherwise. >> Erm, except that some of these xprts are in the context of the server. >> >> For instance: server-side lockd sometimes has to send messages to the >> clients (e.g. GRANTED callbacks). Suppose we're trying to send a message >> to the client, but it has crashed and rebooted...or maybe the client's >> address got handed to another host that isn't doing NFS at all so the >> NLM service will never come back. >> >> This will mean that those RPCs will retry forever now in this situation. >> I'm not sure that's what we want. Maybe we need some way to distinguish >> between "user-driven" RPCs and those that aren't? >> >> As a simpler workaround, would it work to just increase the number of >> retries here? There's nothing magical about 3 tries. ISTM that having it >> retry enough times to cover at least a minute or two would be >> acceptable. > > I'm happy with the simple workaround by just increasing the number of > retries. This would fix the immediate problem that we're facing without > potential of breaking things in other areas. If you agree then I will > prepare the simpler patch for this. A longer retry period is a short-term solution. I can get behind that as long as the patch description makes clear some of the concerns that have been brought up in this email thread. -- Chuck Lever