Re: new cluster acting odd

"Megan ." <nagemnna@xxxxxxxxx> · Tue, 2 Dec 2014 09:04:18 -0500

Ok, thank you.

I did try this at one point and it didn't seem to have an impact.  but
I will try again and try some of the debugging commands provided by
others in this thread.

Thank you again for your help.

On Tue, Dec 2, 2014 at 3:46 AM, Christine Caulfield <ccaulfie@xxxxxxxxxx> wrote:
> On 01/12/14 14:16, Megan . wrote:
>>
>> Good Day,
>>
>> I'm fairly new to the cluster world so i apologize in advance for
>> silly questions.  Thank you for any help.
>>
>> We decided to use this cluster solution in order to share GFS2 mounts
>> across servers.  We have a 7 node cluster that is newly setup, but
>> acting oddly.  It has 3 vmware guest hosts and 4 physical hosts (dells
>> with Idracs).  They are all running Centos 6.6.  I have fencing
>> working (I'm able to do fence_node node and it will fence with
>> success).  I do not have the gfs2 mounts in the cluster yet.
>>
>> When I don't touch the servers, my cluster looks perfect with all
>> nodes online. But when I start testing fencing, I have an odd problem
>> where i end up with split brain between some of the nodes.  They won't
>> seem to automatically fence each other when it gets like this.
>>
>> in the  corosync.log for the node that gets split out i see the totem
>> chatter, but it seems confused and just keeps doing the below over and
>> over:
>>
>>
>> Dec 01 12:39:15 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:17 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:19 corosync [TOTEM ] Retransmit List: 22 24 25 26 27 28 29 2a
>> 2b 2c
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>>
>> Dec 01 12:39:39 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 21 23 24 25 26 27 28 29 2a 2b 32
>> ..
>> ..
>> ..
>> Dec 01 12:54:49 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>> Dec 01 12:54:50 corosync [TOTEM ] Retransmit List: 1 3 4 5 6 7 8 9 a b
>> 1d 1f 20 21 22 23 24 25 26 27 2e 30 31 32 37 38 39 3a 3b 3c
>>
>
> These messages are the key to your problem and nothing will be fixed until
> you can get rid of them. As Digimer said they are often caused by a
> congested network, but it could also be multicast traffic not being passed
> between nodes - a mix of physical and virtual nodes could easily be
> contributing to this. The easiest way to prove this (and get the system
> working possibly) is to switch from multicast to normal UDP unicast traffic
>
> <cman transport="udpu"/>
>
> in cluster.conf. You'll need to to this on all nodes and reboot the whole
> cluster. All in all, it's probably easier that messing around checking
> routers, switches and kernel routing paramaters in a mixed-mode cluster!
>
> Chrissie
>
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster