On Wed, Oct 5, 2011 at 4:39 PM, Lon Hohberger <lhh@xxxxxxxxxx> wrote: > On 10/04/2011 10:23 PM, Ofer Inbar wrote: >> >> On a 3 node cluster running: >> cman-2.0.115-34.el5_5.3 >> rgmanager-2.0.52-6.el5.centos.8 >> openais-0.80.6-16.el5_5.9 >> >> We have a custom resource, "dn", for which I wrote the resource agent. >> Service has three resources: a virtual IP (using ip.sh), and two dn >> children. > > You should be able to disable then re-enable - that is, you shouldn't need > to restart rgmanager to break the recovering state. > > There's this related bug, but it should have been fixed in 2.0.52-6: > > https://bugzilla.redhat.com/show_bug.cgi?id=530409 > I have the same problem with version 2.0.52-6 on rhel5, I'll try to get a dump when it happens again (didn't know the USR1 signal thing) # rpm -aq | grep -e rgmanager -e openais -e cman cman-2.0.115-34.el5_5.4 rgmanager-2.0.52-6.el5_5.8 openais-0.80.6-16.el5_5.9 Thanks, Juanra >> Normally, when one of the dn instances fails its status check, >> rgmanager stops the service (stops dn_a and dn_b, then stops the IP), >> then relocates to another node and starts the service there. > > That's what I'd expect to happen. > >> Several hours ago, one of the dn instances failed its status check, >> rgmanager stopped it, marked the service "recovering", but then did >> not seem to try to start it on any node. It just stayed down for >> hours until logged in to look at it. >> >> Until 17:22 today, service was running on node1. Here's what it logged: >> >> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<err> Monitoring Service >> dn:dn_b> Service Is Not Running >> Oct 4 17:22:12 clustnode1 clurgmgrd[517]:<notice> status on dn "dn_b" >> returned 1 (generic error) >> Oct 4 17:22:12 clustnode1 clurgmgrd[517]:<notice> Stopping service >> service:dn >> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<info> Stopping Service >> dn:dn_b >> Oct 4 17:22:12 clustnode1 clurgmgrd: [517]:<notice> Checking if stopped: >> check_pid_file /dn/dn_b/dn_b.pid >> Oct 4 17:22:14 clustnode1 clurgmgrd: [517]:<info> Stopping Service >> dn:dn_b> Succeed >> Oct 4 17:22:14 clustnode1 clurgmgrd: [517]:<info> Stopping Service >> dn:dn_a >> Oct 4 17:22:15 clustnode1 clurgmgrd: [517]:<notice> Checking if stopped: >> check_pid_file /dn/dn_a/dn_a.pid >> Oct 4 17:22:17 clustnode1 clurgmgrd: [517]:<info> Stopping Service >> dn:dn_a> Succeed >> Oct 4 17:22:17 clustnode1 clurgmgrd: [517]:<info> Removing IPv4 address >> 10.6.9.136/23 from eth0 >> Oct 4 17:22:27 clustnode1 clurgmgrd[517]:<notice> Service service:dn is >> recovering >> >> At around that time, node2 also logged this: >> >> Oct 4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete >> comm_header_t. >> Oct 4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete >> comm_header_t. > > It may be related; I doubt it. > > >> Again, this looks the same on all three nodes. >> >> Here's the resource section of cluster.conf (with the values of some >> of the arguments to my custom resource modified so as not to expose >> actual username, path, or port number): >> >> <rm log_level="6"> >> <service autostart="1" name="dn" recovery="relocate"> >> <ip address="10.6.9.136" monitor_link="1"> >> <dn user="username" dninstall="/dn/path" name="dn_a" >> monitoringport="portnum"/> >> <dn user="username" dninstall="/dn/path" name="dn_b" >> monitoringport="portnum"/> >> </ip> >> </service> >> </rm> >> >> Any ideas why it might be in this state, where everything is >> apparently fine except that the service is "recovering" and rgmanager >> isn't trying to do anything about it and isn't logging any complaints? > > The only cause for this is if we send a message but it either doesn't make > it or we get a weird return code -- I think rgmanager logs it, though, so > this could be a new issue. > >> Attached: strace -fp output of clurgmrgd processes on node1 and node2 > > The strace data is not likely to be useful, but a dump from rgmanager would. > If you get in to this state again, do this: > > kill -USR1 `pidof -s clurgmgrd` > > Then look at /tmp/rgmanager-dump* (2.0.x) or /var/lib/cluster/rgmanager-dump > (3.x.y) > > -- Lon > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster