Ok, thank you. I did try this at one point and it didn't seem to have an impact. but I will try again and try some of the debugging commands provided by others in this thread. Thank you again for your help. On Tue, Dec 2, 2014 at 3:46 AM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote: > On 01/12/14 14:16, Megan . wrote: >> >> Good Day, >> >> I'm fairly new to the cluster world so i apologize in advance for >> silly questions. Thank you for any help. >> >> We decided to use this cluster solution in order to share GFS2 mounts >> across servers. We have a 7 node cluster that is newly setup, but >> acting oddly. It has 3 vmware guest hosts and 4 physical hosts (dells >> with Idracs). They are all running Centos 6.6. I have fencing >> working (I'm able to do fence_node node and it will fence with >> success). I do not have the gfs2 mounts in the cluster yet. >> >> When I don't touch the servers, my cluster looks perfect with all >> nodes online. But when I start testing fencing, I have an odd problem >> where i end up with split brain between some of the nodes. They won't >> seem to automatically fence each other when it gets like this. >> >> in the corosync.log for the node that gets split out i see the totem >> chatter, but it seems confused and just keeps doing the below over and >> over: >> >> >> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a >> 2b 2c >> >> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a >> 2b 2c >> >> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a >> 2b 2c >> >> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b >> >> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b >> 21 23 24 25 26 27 28 29 2a 2b 32 >> .. >> .. >> .. >> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b >> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c >> >> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b >> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c >> >> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b >> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c >> > > These messages are the key to your problem and nothing will be fixed until > you can get rid of them. As Digimer said they are often caused by a > congested network, but it could also be multicast traffic not being passed > between nodes - a mix of physical and virtual nodes could easily be > contributing to this. The easiest way to prove this (and get the system > working possibly) is to switch from multicast to normal UDP unicast traffic > > <cman transport="udpu"/> > > in cluster.conf. You'll need to to this on all nodes and reboot the whole > cluster. All in all, it's probably easier that messing around checking > routers, switches and kernel routing paramaters in a mixed-mode cluster! > > Chrissie > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster