On Tue, Apr 27, 2021 at 09:03:11PM +0200, Rogier Wolff wrote: > > Hi, > > Two things..... > > I got: > > lockd: cannot monitor <client> > > in the logfile and the client was terrily slow/not working at all. > > everything pointed to a lockd problem... > > In the end... it turns out that my rpc.statd stopped working. I had > to go and download the sources to figure this out... I would firstly > suggest to improve the error message to give others running into this > more hints as to where to look. > > The erorr message on line 169 of lockd.c could read: > > lockd: Error in the rpc to rpc.statd to monitor %s\n > > Would it be an idea to print the res.status error code? I'm not sure about the wording, but including the error code sounds like a good idea. (Would that have made a difference in your case?) > That said... > > When this situation is going on, the client grinds to a halt, and > lockd seems "stuck" in D state. I tried killing or stracing it, to try > to clear the error, before I found out it is a kernel deamon... > > When this failure happens, I get the impression that lockd keeps on > trying to be "of service", retrying operations that are bound to > fail. So maybe the error should be cached, and then immediately > handled instead of making the client grind to a halt. (it is the (one > second?) timeout in nsm_mon_unmon and the big backlog of requests that > result in the same call and timeout that frustrate the client... ) The -ECONNREFUSED case? I'm not sure why it retries there. Maybe just to allow stopping and starting rpc.statd (e.g. for upgrades) without failing operations? --b.