Please find attached the cluster.conf file and the relevant logs from both servers. There are two scenarios executed: 1) From 11:48:00 till 11:55 (This is a normal/expected situation) app01 is active. Kernel panic at 11:48:00 app02 resumes normally the service app01 re-joins the cluster at 11:50:00 Kernel panic on app02 at 11:50:45 app01 starts normally the service app02 re-joins the cluster correctly 2) From 11:55:30 till end (This is where the problem appear) app01 is active. Kernel panic at 11:55:30 app02 resumes normally the service app01 re-joins the cluster at 11:57:07 Manually migrate the service to app01 at 11:58:40 Service start normally on app01 kernel panic on app01 at 12:00:35 service resumes normally on app02 app01 re-joins the cluster at 12:02:09 After that, the clustat output on node app02 is: Cluster Status for par_clu @ Wed Jan 29 12:30:46 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ adr-par-app01-hb 1 Online adr-par-app02-hb 2 Online, Local, rgmanager Service Name Owner (Last) State ------- ---- ----- ------ ----- service:sv-CPAR adr-par-app02-hb started and on node app01 is: Cluster Status for par_clu @ Wed Jan 29 12:30:43 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ adr-par-app01-hb 1 Online, Local adr-par-app02-hb 2 Online The output of "ps -ef | grep rgmanager" on node app01 is: root 4034 1 0 12:02 ? 00:00:00 rgmanager root 4036 4034 0 12:02 ? 00:00:00 rgmanager root 4175 4036 0 12:02 ? 00:00:00 rgmanager The problem is that rgmanager is not active anymore on node app01. As a workaround, killing the last process (pid 4175) resumes the rgmanager without restart. Thanks for your help. BR, Demetres
<?xml version="1.0"?> <cluster config_version="17" name="par_clu"> <logging debug="on"/> <cman expected_votes="1" transport="udpu" two_node="1"/> <clusternodes> <clusternode name="adr-par-app01-hb" nodeid="1"> <fence> <method name="FncSCSI"> <device name="FenceSCSI"/> </method> </fence> <unfence> <device action="on" name="FenceSCSI"/> </unfence> </clusternode> <clusternode name="adr-par-app02-hb" nodeid="2"> <fence> <method name="FncSCSI"> <device name="FenceSCSI"/> </method> </fence> <unfence> <device action="on" name="FenceSCSI"/> </unfence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_scsi" devices="/dev/emcpowera" logfile="/var/log/cluster/fence_scsi.log" name="FenceSCSI"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="CPAR" nofailback="1" ordered="1" restricted="0"> <failoverdomainnode name="adr-par-app01-hb" priority="1"/> <failoverdomainnode name="adr-par-app02-hb" priority="2"/> </failoverdomain> </failoverdomains> <resources> <lvm lv_name="lvpar" name="lvpar" self_fence="1" vg_name="vgpar"/> <fs device="/dev/vgpar/lvpar" force_fsck="1" force_unmount="1" fstype="ext4" mountpoint="/shared" name="fspar" self_fence="1"> <action depth="*" interval="10" name="status"/> </fs> <script file="/etc/init.d/arserver_ICOM" name="scCPAR"/> <ip address="10.120.158.7" disable_rdisc="1" monitor_link="1" sleeptime="2"/> </resources> <service domain="CPAR" name="sv-CPAR" recovery="relocate"> <lvm ref="lvpar"> <fs ref="fspar"> <ip ref="10.120.158.7"> <script ref="scCPAR"/> </ip> </fs> </lvm> </service> </rm> <fence_daemon/> <dlm protocol="tcp"/> </cluster>
Attachment:
messages_app01.txt.gz
Description: GNU Zip compressed data
Attachment:
messages_app02.txt.gz
Description: GNU Zip compressed data
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster