Good Day, I'm fairly new to the cluster world so i apologize in advance for silly questions. Thank you for any help. We decided to use this cluster solution in order to share GFS2 mounts across servers. We have a 7 node cluster that is newly setup, but acting oddly. It has 3 vmware guest hosts and 4 physical hosts (dells with Idracs). They are all running Centos 6.6. I have fencing working (I'm able to do fence_node node and it will fence with success). I do not have the gfs2 mounts in the cluster yet. When I don't touch the servers, my cluster looks perfect with all nodes online. But when I start testing fencing, I have an odd problem where i end up with split brain between some of the nodes. They won't seem to automatically fence each other when it gets like this. in the corosync.log for the node that gets split out i see the totem chatter, but it seems confused and just keeps doing the below over and over: Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b 21 23 24 25 26 27 28 29 2a 2b 32 .. .. .. Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c I can manually fence it, and it still comes online with the same issue. I end up having to take the whole cluster down, sometimes forcing reboot on some nodes, then brining it back up. Its takes a good part of the day just to bring the whole cluster online again. I used ccs -h node --sync --activate and double checked to make sure they are all using the same version of the cluster.conf file. Once issue I did notice, is that when one of the vmware hosts is rebooted, the time comes off slitty skewed (6 seconds) but i thought i read somewhere that a skew that minor shouldn't impact the cluster. We have multicast enabled on the interfaces UP BROADCAST RUNNING MASTER MULTICAST MTU:9000 Metric:1 and we have been told by our network team that IGMP snooping is disabled. With tcpdump I can see the multi-cast traffic chatter. Right now: [root@data1-uat ~]# clustat Cluster Status for projectuat @ Mon Dec 1 13:56:39 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ archive1-uat.domain.com 1 Online admin1-uat.domain.com 2 Online mgmt1-uat.domain.com 3 Online map1-uat.domain.com 4 Online map2-uat.domain.com 5 Online cache1-uat.domain.com 6 Online data1-uat.domain.com 8 Online, Local ** Has itself ass online ** [root@map1-uat ~]# clustat Cluster Status for projectuat @ Mon Dec 1 13:57:07 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ archive1-uat.domain.com 1 Online admin1-uat.domain.com 2 Online mgmt1-uat.domain.com 3 Online map1-uat.domain.com 4 Offline, Local map2-uat.domain.com 5 Online cache1-uat.domain.com 6 Online data1-uat.domain.com 8 Online [root@cache1-uat ~]# clustat Cluster Status for projectuat @ Mon Dec 1 13:57:39 2014 Member Status: Quorate Member Name ID Status ------ ---- ---- ------ archive1-uat.domain.com 1 Online admin1-uat.domain.com 2 Online mgmt1-uat.domain.com 3 Online map1-uat.domain.com 4 Online map2-uat.domain.com 5 Online cache1-uat.domain.com 6 Offline, Local data1-uat.domain.com 8 Online [root@mgmt1-uat ~]# clustat Cluster Status for projectuat @ Mon Dec 1 13:58:04 2014 Member Status: Inquorate Member Name ID Status ------ ---- ---- ------ archive1-uat.domain.com 1 Offline admin1-uat.domain.com 2 Offline mgmt1-uat.domain.com 3 Online, Local map1-uat.domain.com 4 Offline map2-uat.domain.com 5 Offline cache1-uat.domain.com 6 Offline data1-uat.domain.com 8 Offline cman-3.0.12.1-68.el6.x86_64 [root@data1-uat ~]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="66" name="projectuat"> <clusternodes> <clusternode name="admin1-uat.domain.com" nodeid="2"> <fence> <method name="fenceadmin1uat"> <device name="vcappliancesoap" port="admin1-uat" ssl="on" uuid="421df3c4-a686-9222-366e-9a67b25f62b2"/> </method> </fence> </clusternode> <clusternode name="mgmt1-uat.domain.com" nodeid="3"> <fence> <method name="fenceadmin1uat"> <device name="vcappliancesoap" port="mgmt1-uat" ssl="on" uuid="421d5ff5-66fa-5703-66d3-97f845cf8239"/> </method> </fence> </clusternode> <clusternode name="map1-uat.domain.com" nodeid="4"> <fence> <method name="fencemap1uat"> <device name="idracmap1uat"/> </method> </fence> </clusternode> <clusternode name="map2-uat.domain.com" nodeid="5"> <fence> <method name="fencemap2uat"> <device name="idracmap2uat"/> </method> </fence> </clusternode> <clusternode name="cache1-uat.domain.com" nodeid="6"> <fence> <method name="fencecache1uat"> <device name="idraccache1uat"/> </method> </fence> </clusternode> <clusternode name="data1-uat.domain.com" nodeid="8"> <fence> <method name="fencedata1uat"> <device name="idracdata1uat"/> </method> </fence> </clusternode> <clusternode name="archive1-uat.domain.com" nodeid="1"> <fence> <method name="fenceadmin1uat"> <device name="vcappliancesoap" port="archive1-uat" ssl="on" uuid="421d16b2-3ed0-0b9b-d530-0b151d81d24e"/> </method> </fence> </clusternode> </clusternodes> <fencedevices> <fencedevice agent="fence_vmware_soap" ipaddr="x.x.x.130" login="fenceuat" login_timeout="10" name="vcappliancesoap" passwd_script="/etc/cluster/forfencing.sh" power_timeout="10" power_wait="30" retry_on="3" shell_timeout="10" ssl="1"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.47" login="fenceuat" name="idracdata1uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.48" login="fenceuat" name="idracdata2uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.82" login="fenceuat" name="idracmap1uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.96" login="fenceuat" name="idracmap2uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.83" login="fenceuat" name="idraccache1uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> <fencedevice agent="fence_drac5" cmd_prompt="admin1->" ipaddr="x.x.x.97" login="fenceuat" name="idraccache2uat" passwd_script="/etc/cluster/forfencing.sh" power_timeout="60" power_wait="60" retry_on="10" secure="on" shell_timeout="10"/> </fencedevices> </cluster> -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster