Re: new cluster acting odd

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Tue, 02 Dec 2014 08:46:29 +0000

On 01/12/14 14:16, Megan . wrote:
Good Day,

I'm fairly new to the cluster world so i apologize in advance for
silly questions.  Thank you for any help.

We decided to use this cluster solution in order to share GFS2 mounts
across servers.  We have a 7 node cluster that is newly setup, but
acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
with Idracs).  They are all running Centos 6.6.  I have fencing
working (I'm able to do fence_node node and it will fence with
success).  I do not have the gfs2 mounts in the cluster yet.

When I don't touch the servers, my cluster looks perfect with all
nodes online. But when I start testing fencing, I have an odd problem
where i end up with split brain between some of the nodes.  They won't
seem to automatically fence each other when it gets like this.

in the  corosync.log for the node that gets split out i see the totem
chatter, but it seems confused and just keeps doing the below over and
over:

Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a 2b 2c

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b

Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
21 23 24 25 26 27 28 29 2a 2b 32
..
..
..
Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c

These messages are the key to your problem and nothing will be fixed 
until you can get rid of them. As Digimer said they are often caused by 
a congested network, but it could also be multicast traffic not being 
passed between nodes - a mix of physical and virtual nodes could easily 
be contributing to this. The easiest way to prove this (and get the 
system working possibly) is to switch from multicast to normal UDP 
unicast traffic

<cman transport="udpu"/>

in cluster.conf. You'll need to to this on all nodes and reboot the 
whole cluster. All in all, it's probably easier that messing around 
checking routers, switches and kernel routing paramaters in a 
mixed-mode cluster!

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster