Re: Lockd error message is unclear.

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Tue, 27 Apr 2021 15:34:52 -0400

On Tue, Apr 27, 2021 at 09:03:11PM +0200, Rogier Wolff wrote:
> 
> Hi, 
> 
> Two things..... 
> 
> I got: 
> 
>    lockd: cannot monitor <client> 
> 
> in the logfile and the client was terrily slow/not working at all.
> 
> everything pointed to a lockd problem... 
> 
> In the end... it turns out that my rpc.statd stopped working.  I had
> to go and download the sources to figure this out... I would firstly
> suggest to improve the error message to give others running into this
> more hints as to where to look.
> 
> The erorr message on line 169 of lockd.c could read: 
> 
> 	lockd: Error in the rpc to rpc.statd to monitor %s\n
> 
> Would it be an idea to print the res.status error code? 

I'm not sure about the wording, but including the error code sounds like
a good idea.  (Would that have made a difference in your case?)

> That said... 
> 
> When this situation is going on, the client grinds to a halt, and
> lockd seems "stuck" in D state. I tried killing or stracing it, to try
> to clear the error, before I found out it is a kernel deamon...
> 
> When this failure happens, I get the impression that lockd keeps on
> trying to be "of service", retrying operations that are bound to
> fail. So maybe the error should be cached, and then immediately
> handled instead of making the client grind to a halt. (it is the (one
> second?) timeout in nsm_mon_unmon and the big backlog of requests that
> result in the same call and timeout that frustrate the client... )

The -ECONNREFUSED case?

I'm not sure why it retries there.  Maybe just to allow stopping and
starting rpc.statd (e.g. for upgrades) without failing operations?

--b.