Re: new cluster acting odd

"Megan ." <nagemnna@xxxxxxxxx> · Mon, 1 Dec 2014 13:03:50 -0500

We have 11 10-20TB GFS2 mounts that I need to share across all nodes.
Its the only reason we went with the cluster solution.  I don't know
how we could split it up into different smaller clusters.

On Mon, Dec 1, 2014 at 12:14 PM, Digimer <lists@xxxxxxxxxx> wrote:
> On 01/12/14 11:56 AM, Megan . wrote:
>>
>> Thank you for your replies.
>>
>> The cluster is intended to be 9 nodes, but i haven't finished building
>> the remaining 2.  Our production cluster is expected to be similar in
>> size.  What tuning should I be looking at?
>>
>>
>> Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
>> remove IP addresses.
>
>
> Can you simplify those fencedevice definitions? I would wonder if the set
> timeouts could be part of the problem. Always start with the simplest
> possible configurations and only add options in response to actual issues
> discovered in testing.

I can try to simplify.  I had the longer timeouts because what I saw
happening on the physical boxes, was the box would be on its way
down/up and the fence command would fail, but the box actually did
come back online.  The physicals take 10-15 minutes to reboot and i
wasn't sure how to handle timeout issues, so i made the timeouts a bit
extreme for testing. I'll try to make the config more vanilla for
troubleshooting.

>> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
>> the cluster kept seeing it as online and never fenced it, yet i could
>> no longer ssh to the node.  I did this on a physical and VM box with
>> the same result.  I had to fence_node node to get it to reboot, but it
>> came up split brained (thinking it was the only one online). Now that
>> node has cman down and the rest of the cluster sees it as still
>> online.
>
>
> Then corosync failed to detect the fault. That is a sign, to me, of a
> fundamental network or configuration issue. Corosync should have shown
> messages about a node being lost and reconfiguring. If that didn't happen,
> then you're not even up to the point where fencing factors in.
>
> Did you configure corosync.conf? When it came up, did it think it was
> quorate or inquorate?

corosync.conf didn't work since it seems the RedHat HA Cluster doesn't
use that file.  http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf
 I tried it since we wanted to try to put the multicast traffic on a
different bond/vlan but we figured out the file isn't used.

>> I thought fencing was working because i'm able to do fence_node node
>> and see the box reboot and come back online.  I did have to get the FC
>> version of the fence_agents because of an issue with the idrac agent
>> not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64
>
>
> That tells you that the configuration of the fence agents is working, but it
> doesn't test failure detection. You can use the 'fence_check' tool to see if
> the cluster can talk to everything, but in the end, the only useful test is
> to simulate an actual crash.
>
> Wait; 'fc14' ?! What OS are you using?
>
>

We are Centos 6.6.  I went with the fedora core agents because of this
exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7
 I read that it was fixed in the next version, which i could only find
for FC.

>> fence_tool dump worked on one of my nodes, but it is just hanging on the
>> rest.
>>
>> [root@map1-uat ~]# fence_tool dump
>> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448610 fenced 3.0.12.1 started
>> 1417448610 connected to dbus :1.12
>> 1417448610 cluster node 1 added seq 89048
>> 1417448610 cluster node 2 added seq 89048
>> 1417448610 cluster node 3 added seq 89048
>> 1417448610 cluster node 4 added seq 89048
>> 1417448610 cluster node 5 added seq 89048
>> 1417448610 cluster node 6 added seq 89048
>> 1417448610 cluster node 8 added seq 89048
>> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
>> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
>> /var/log/cluster/fenced.log
>> 1417448611 logfile cur mode 100644
>> 1417448611 cpg_join fenced:daemon ...
>> 1417448621 daemon cpg_join error retrying
>> 1417448631 daemon cpg_join error retrying
>> 1417448641 daemon cpg_join error retrying
>> 1417448651 daemon cpg_join error retrying
>> 1417448661 daemon cpg_join error retrying
>> 1417448671 daemon cpg_join error retrying
>> 1417448681 daemon cpg_join error retrying
>> 1417448691 daemon cpg_join error retrying
>> .
>> .
>> .
>>
>>
>> [root@map1-uat ~]# clustat
>> Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
>> Member Status: Quorate
>>
>>   Member Name                                                     ID
>> Status
>>   ------ ----                                                     ----
>> ------
>>   archive1-uat.project.domain.com                                1 Online
>>   admin1-uat.project.domain.com                                  2 Online
>>   mgmt1-uat.project.domain.com                                   3 Online
>>   map1-uat.project.domain.com                                    4 Online,
>> Local
>>   map2-uat.project.domain.com                                    5 Online
>>   cache1-uat.project.domain.com                                 6 Online
>>   data1-uat.project.domain.com                                   8 Online
>>
>>
>> The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
>> 16:02:34 fenced cpg_join error retrying every 10th of a second.
>>
>> Obviously having some major issues.  These are fresh boxes, no other
>> services right now other then ones related to the cluster.
>
>
> What OS/version?
>
>> I've also experimented with the  <cman transport="udpu"/> to disable
>> multicast to see if that helped but it doesn't seem to make a
>> difference with the node stability.
>
>
> Very bad idea with >2~3 node clusters. The overhead will be far too great
> for a 7~9 node cluster.
>
>> Is there a document or some sort of reference that I can give the
>> network folks on how the switches should be configured?  I read stuff
>> on boards about IGMP snooping, but I couldn't find anything from
>> RedHat to hand them.
>
>
> I have this:
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations
>
> https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network
>
> There are comments in there about multicast, etc.
>

Thank you for the links.  I will review them with our network folks,
hopefully it will help us sort out some of our issues.

I will use the fence_check tool to see if i can troubleshoot the fencing.

Thank you very much for all of your suggestions.

>
> --
> Digimer
> Papers and Projects: https://alteeve.ca/w/
> What if the cure for cancer is trapped in the mind of a person without
> access to education?
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster