Isn't it peculiar that it hasn't realised before. This problem seems to exist since the release of rhel 4.2 (as far as I have realised) or maybe earlier.. I hope this issue gets solved as soon as possible. By the way thanks for the workaround Marc.. > On Mon, Feb 13, 2006 at 11:56:20AM -0800, Marc Lewis wrote: >> On Sat, Feb 11, 2006 at 11:57:32PM +0200, Omer Faruk Sen wrote: >> > >> > Hi, >> > >> > I have a problem with redhat cluster suite. I have a two node test >> > cluster. cluster master is cluster2 which is running vsftpd, mysql as >> > service. I have manually edited vsftpd.conf so it can't start on >> cluster2. >> > Then I killed vsftpd process. But after that cluster didn't failover >> to >> > cluster1 or after a few seconds I have corrected vsftpd.conf but >> cluster2 >> > didn't start this service. What I want to ask that is redhat cluster >> > doesn't support service status check so it can restart failed service >> or >> > does it support fail over if one resource doesn't work? >> > >> > Best regards, >> >> I'm seeing similar issues here. The script entry doesn't seem to do >> anything when checking status. >> >> For example, we have a MySQL service defined with an IP address, a >> shared >> SAN partition, and the /etc/init.d/mysqld script. >> >> The service starts up and shuts down fine when done manually via >> clusvcadm, >> but if I kill the mysql daemon with the script or manually, the >> clurgmgrd >> doesn't seem to care. It just runs its status check, which does report >> it >> as "stopped" without ever restarting the service. > > Just wanted to followup and say that I've solved the status check problem, > sort of. I decided to play with the exit value of various init scripts to > see what, if any, effect they would have on clurgmgrd, and managed to get > something cobbled together that works. Its not the best best solution, > but > it should do. > > To get these scripts to work, its important that the "status" and "stop" > return values that clurgmgrd can deal with. > > If status returns a non-zero value (i.e. an error) then clurgmgrd will > think the service has failed and attempt to restart it. It does this by > issuing a "stop" command, taking down the other resources associated with > it, and then bringing them all back up. > > Its the stop command that can cause some problems. For examample, in my > service defined above, I have the service "MySQL", which has the IP > address, shared storage and the /etc/init.d/mysqld script. I start it up > using "clusvcadm -e MySQL" and all is well, it brings everything up in the > correct order and MySQL is running fine. Every 30 seconds, I see it > running a "status" check in the syslog. So far so good. > > Now, I have modified the mysqld script to return the value of "status" > from > /etc/init.d/functions as its exit code. So, when everything is running > fine, it returns "0" and clurgmgrd is happy. If I do a "killall -9 > mysqld_safe mysqld" status will now return a value of 2, which is an > error. > clurgmgrd will attempt to restart it by issuing the "stop" command to the > script. This is where we run into problems. > > Since the service is already dead, the startup scripts return an error > when > trying to stop the service. clurgmgrd fails the service and the service > is > now down. > > The only way I've found around this is to force the "stop" to return 0 no > matter what. This way clurgmgrd will believe it has succeeded in shutting > down the service and will restart it. > > My reasoning is that it is better to have it fail starting it up than to > have it fail stopping a service that is already dead. I'm sure there are > other problems with this method, but I haven't identified them yet. > >> Also, I've seen clurgmgrd die without logging anything anywhere. I'll >> just >> check the cluster and it won't be running. All of the services stay >> running, but the manager is dead. Restarting it is problematic since it >> will restart each of the services causing a brief interruption. >> >> Anyone have any ideas on how to solve either of these two problems? >> I've >> been waiting to deploy the cluster we've put together until I could >> resolve >> these two issues, but have run out of things to try. > > I'm still seeing clurgmgrd die periodically for no reason, though. I may > have to write another script to monitor it as well and run that out of > cron > every so often. That doesn't seem like a very good solution, though since > it does restart all of the services that are running on that node. > > - Marc > > -- > Marc Lewis > Blarg! Online Services, Inc. > > -- > > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Omer Faruk Sen http://www.faruk.net -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster