Re: service stuck in "recovering", no attempt to restart

Lon Hohberger <lhh@xxxxxxxxxx> · Wed, 05 Oct 2011 10:39:05 -0400

On 10/04/2011 10:23 PM, Ofer Inbar wrote:
On a 3 node cluster running:
   cman-2.0.115-34.el5_5.3
   rgmanager-2.0.52-6.el5.centos.8
   openais-0.80.6-16.el5_5.9

We have a custom resource, "dn", for which I wrote the resource agent.
Service has three resources: a virtual IP (using ip.sh), and two dn children.

You should be able to disable then re-enable - that is, you shouldn't 
need to restart rgmanager to break the recovering state.

There's this related bug, but it should have been fixed in 2.0.52-6:

  https://bugzilla.redhat.com/show_bug.cgi?id=530409

Normally, when one of the dn instances fails its status check,
rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
then relocates to another node and starts the service there.

That's what I'd expect to happen.

Several hours ago, one of the dn instances failed its status check,
rgmanager stopped it, marked the service "recovering", but then did
not seem to try to start it on any node.  It just stayed down for
hours until logged in to look at it.

Until 17:22 today, service was running on node1.  Here's what it logged:

Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<err>  Monitoring Service dn:dn_b>  Service Is Not Running
Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  status on dn "dn_b" returned 1 (generic error)
Oct  4 17:22:12 clustnode1 clurgmgrd[517]:<notice>  Stopping service service:dn
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_b>  Succeed
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a
Oct  4 17:22:15 clustnode1 clurgmgrd: [517]:<notice>  Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Stopping Service dn:dn_a>  Succeed
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]:<info>  Removing IPv4 address 10.6.9.136/23 from eth0
Oct  4 17:22:27 clustnode1 clurgmgrd[517]:<notice>  Service service:dn is recovering

At around that time, node2 also logged this:

Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.

It may be related; I doubt it.

Again, this looks the same on all three nodes.

Here's the resource section of cluster.conf (with the values of some
of the arguments to my custom resource modified so as not to expose
actual username, path, or port number):

<rm log_level="6">
   <service autostart="1" name="dn" recovery="relocate">
     <ip address="10.6.9.136" monitor_link="1">
       <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
       <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
     </ip>
   </service>
</rm>

Any ideas why it might be in this state, where everything is
apparently fine except that the service is "recovering" and rgmanager
isn't trying to do anything about it and isn't logging any complaints?

The only cause for this is if we send a message but it either doesn't 
make it or we get a weird return code -- I think rgmanager logs it, 
though, so this could be a new issue.

Attached: strace -fp output of clurgmrgd processes on node1 and node2

The strace data is not likely to be useful, but a dump from rgmanager 
would.  If you get in to this state again, do this:

   kill -USR1 `pidof -s clurgmgrd`

Then look at /tmp/rgmanager-dump* (2.0.x) or 
/var/lib/cluster/rgmanager-dump (3.x.y)

-- Lon

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster