Thanks once again, Scott, The patch seems like a neat solution It will declare rpc.statd dead only after checking multiple times. Super. I will patch the nfsserver resource file. (It should probably be added to the original source code.) We have a proper entry for 127.0.0.1 in the hosts file and the nsswtich.conf file says, "files dns". So if it checks the /etc/hosts first, why would the pacemaker's check timeout? Shouldn't pacemaker get a quick response? Localhost entries in the /etc/hosts file - ------------------------------------------------------------------------------------------------------------------------------------- 127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4 ::1 localhost localhost.localdomain localhost6 localhost6.localdomain6 ------------------------------------------------------------------------------------------------------------------------------------- Regards, Indivar Nair On Fri, Jul 12, 2019 at 7:47 PM Scott Mayhew <smayhew@xxxxxxxxxx> wrote: > > On Fri, 12 Jul 2019, Indivar Nair wrote: > > > Hi Scott, > > > > Thanks a lot. > > Yes, it is a 10+ year old AD setup, which was migrated to Samba4AD > > (samba+named) a few years ago. > > It has lot of stale entries, and fwd - rev lookup mismatches. > > > > Will start cleaning up DNS right away. > > > > In the meantime, is there any way to increase the rpc ping timeout? > > You could build the rpcinfo.c program from source (it's in the rpcbind > git tree) with a longer timeout. > > > OR > > Is there any way to temporarily disable DNS lookups by lockd? > > rpc.statd is doing the DNS lookups, and no there's not a way to disable it. > Doing so would probably make the reboot notifications less reliable, and > clients wouldn't know to reclaim locks, and the risk of data corruption > goes up. > > You could prevent the DNS lookups from occurring by adding entries (properly > formatted... see the hosts(5) man page) for all your clients into the > /etc/hosts file on your NFS server nodes. That's assuming that your nsswitch > configuration has "files" before "dns" for host lookups. Depending on how > many clients you have, that might be the easiest option. > > Another option might be to try adding a simple retry mechanism to the part > of the nfsserver resource agent that checks rpc.statd, something like: > > diff --git a/heartbeat/nfsserver b/heartbeat/nfsserver > index bf59da98..c9dcc74e 100755 > --- a/heartbeat/nfsserver > +++ b/heartbeat/nfsserver > @@ -334,8 +334,13 @@ nfsserver_systemd_monitor() > fi > > ocf_log debug "Status: rpc-statd" > - rpcinfo -t localhost 100024 > /dev/null 2>&1 > - rc=$? > + for i in `seq 1 3`; do > + rpcinfo -t localhost 100024 >/dev/null 2>&1 > + rc=$? > + if [ $rc -eq 0 ]; then > + break > + fi > + done > if [ "$rc" -ne "0" ]; then > ocf_exit_reason "rpc-statd is not running" > return $OCF_NOT_RUNNING > > > > Regards, > > > > > > Indivar Nair > > > > On Thu, Jul 11, 2019 at 10:19 PM Scott Mayhew <smayhew@xxxxxxxxxx> wrote: > > > > > > On Thu, 11 Jul 2019, Indivar Nair wrote: > > > > > > > Hi ..., > > > > > > > > I have a 2 node Pacemaker cluster built using CentOS 7.6.1810 > > > > It serves files using NFS and Samba. > > > > > > > > Every 15 - 20 minutes, the rpc.statd service fails, and the whole NFS > > > > service is restarted. > > > > After investigation, it was found that the service fails after a few > > > > rounds of monitoring by Pacemaker. > > > > The Pacemaker's script runs the following command to check whether all > > > > the services are running - > > > > --------------------------------------------------------------------------------------------------------------------------------------- > > > > rpcinfo > /dev/null 2>&1 > > > > rpcinfo -t localhost 100005 > /dev/null 2>&1 > > > > nfs_exec status nfs-idmapd > $fn 2>&1 > > > > rpcinfo -t localhost 100024 > /dev/null 2>&1 > > > > > > I would check to make sure your DNS setup is working properly. > > > rpc.statd uses the canonical hostnames for comparison purposes whenever > > > it gets an SM_MON or SM_UNMON request from lockd and when it gets an > > > SM_NOTIFY from a rebooted NFS client. That involves calls to > > > getaddrinfo() and getnameinfo() which in turn could result in requests > > > to a DNS server. rpc.statd is single-threaded, so if it's blocked > > > waiting for one of those requests, then it's unable to respond to the > > > RPC ping (which has a timeout of 10 seconds) generated by the rpcinfo > > > program. > > > > > > I ran into a similar scenario in the past where a client was launching > > > multiple instances of rpc.statd. When the client does a v3 mount it > > > does a similar RPC ping (with a more aggressive timeout) to see if > > > rpc.statd is running... if not then it calls out to > > > /usr/sbin/start-statd (which in the past simply called 'exec rpc.statd > > > --no-notify' but now has additional checks). Likewise rpc.statd does > > > it's own RPC ping to make sure there's not one already running. It > > > wound up that the user had a flakey DNS server and requests were taking > > > over 30 seconds to time out, thus thwarting all those additional checks, > > > and they wound up with multiple copies of rpc.statd running. > > > > > > You could be running into a similar scenario here and pacemaker could be > > > deciding that rpc.statd's not running when it's actually fine. > > > > > > -Scott > > > > > > > --------------------------------------------------------------------------------------------------------------------------------------- > > > > The script is scheduled to check every 20 seconds. > > > > > > > > This is the message we get in the logs - > > > > ------------------------------------------------------------------------------------------------------------------------------------- > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > > > > 127.0.0.1 ALLOWED > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > > > > from 127.0.0.1 > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > > > > 127.0.0.1 ALLOWED (cached) > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > > > > from 127.0.0.1 > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: check_default: access by > > > > 127.0.0.1 ALLOWED (cached) > > > > Jul 09 07:33:56 virat-nd01 rpc.mountd[51641]: Received NULL request > > > > from 127.0.0.1 > > > > ------------------------------------------------------------------------------------------------------------------------------------- > > > > > > > > After 10 seconds, we get his message - > > > > ------------------------------------------------------------------------------------------------------------------------------------- > > > > Jul 09 07:34:09 virat-nd01 nfsserver(virat-nfs-daemon)[54087]: ERROR: > > > > rpc-statd is not running > > > > ------------------------------------------------------------------------------------------------------------------------------------- > > > > Once we get this error, the NFS service is automatically restarted. > > > > > > > > "ERROR: rpc-statd is not running" message is from the pacemaker's > > > > monitoring script. > > > > I have pasted that part of the script below. > > > > > > > > I disabled monitoring and everything is working fine, since then. > > > > > > > > I cant keep the cluster monitoring disabled forever. > > > > > > > > Kindly help. > > > > > > > > Regards, > > > > > > > > > > > > Indivar Nair > > > > > > > > Part of the pacemaker script that does the monitoring > > > > (/usr/lib/ocf/resources.d/heartbeat/nfsserver) > > > > ======================================================================= > > > > nfsserver_systemd_monitor() > > > > { > > > > local threads_num > > > > local rc > > > > local fn > > > > > > > > ocf_log debug "Status: rpcbind" > > > > rpcinfo > /dev/null 2>&1 > > > > rc=$? > > > > if [ "$rc" -ne "0" ]; then > > > > ocf_exit_reason "rpcbind is not running" > > > > return $OCF_NOT_RUNNING > > > > fi > > > > > > > > ocf_log debug "Status: nfs-mountd" > > > > rpcinfo -t localhost 100005 > /dev/null 2>&1 > > > > rc=$? > > > > if [ "$rc" -ne "0" ]; then > > > > ocf_exit_reason "nfs-mountd is not running" > > > > return $OCF_NOT_RUNNING > > > > fi > > > > > > > > ocf_log debug "Status: nfs-idmapd" > > > > fn=`mktemp` > > > > nfs_exec status nfs-idmapd > $fn 2>&1 > > > > rc=$? > > > > ocf_log debug "$(cat $fn)" > > > > rm -f $fn > > > > if [ "$rc" -ne "0" ]; then > > > > ocf_exit_reason "nfs-idmapd is not running" > > > > return $OCF_NOT_RUNNING > > > > fi > > > > > > > > ocf_log debug "Status: rpc-statd" > > > > rpcinfo -t localhost 100024 > /dev/null 2>&1 > > > > rc=$? > > > > if [ "$rc" -ne "0" ]; then > > > > ocf_exit_reason "rpc-statd is not running" > > > > return $OCF_NOT_RUNNING > > > > fi > > > > > > > > nfs_exec is-active nfs-server > > > > rc=$? > > > > > > > > # Now systemctl is-active can't detect the failure of kernel > > > > process like nfsd. > > > > # So, if the return value of systemctl is-active is 0, check the > > > > threads number > > > > # to make sure the process is running really. > > > > # /proc/fs/nfsd/threads has the numbers of the nfsd threads. > > > > if [ $rc -eq 0 ]; then > > > > threads_num=`cat /proc/fs/nfsd/threads 2>/dev/null` > > > > if [ $? -eq 0 ]; then > > > > if [ $threads_num -gt 0 ]; then > > > > return $OCF_SUCCESS > > > > else > > > > return 3 > > > > fi > > > > else > > > > return $OCF_ERR_GENERIC > > > > fi > > > > fi > > > > > > > > return $rc > > > > } > > > > =======================================================================