Re: new cluster acting odd

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/12/14 11:56 AM, Megan . wrote:
Thank you for your replies.

The cluster is intended to be 9 nodes, but i haven't finished building
the remaining 2.  Our production cluster is expected to be similar in
size.  What tuning should I be looking at?


Here is a link to our config.  http://pastebin.com/LUHM8GQR  I had to
remove IP addresses.

Can you simplify those fencedevice definitions? I would wonder if the set timeouts could be part of the problem. Always start with the simplest possible configurations and only add options in response to actual issues discovered in testing.

I tried the method of (echo c > /proc/sysrq-trigger) to crash a node,
the cluster kept seeing it as online and never fenced it, yet i could
no longer ssh to the node.  I did this on a physical and VM box with
the same result.  I had to fence_node node to get it to reboot, but it
came up split brained (thinking it was the only one online). Now that
node has cman down and the rest of the cluster sees it as still
online.

Then corosync failed to detect the fault. That is a sign, to me, of a fundamental network or configuration issue. Corosync should have shown messages about a node being lost and reconfiguring. If that didn't happen, then you're not even up to the point where fencing factors in.

Did you configure corosync.conf? When it came up, did it think it was quorate or inquorate?

I thought fencing was working because i'm able to do fence_node node
and see the box reboot and come back online.  I did have to get the FC
version of the fence_agents because of an issue with the idrac agent
not working properly.  We are running fence-agents-3.1.6-1.fc14.x86_64

That tells you that the configuration of the fence agents is working, but it doesn't test failure detection. You can use the 'fence_check' tool to see if the cluster can talk to everything, but in the end, the only useful test is to simulate an actual crash.

Wait; 'fc14' ?! What OS are you using?

fence_tool dump worked on one of my nodes, but it is just hanging on the rest.

[root@map1-uat ~]# fence_tool dump
1417448610 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/fenced.log
1417448610 fenced 3.0.12.1 started
1417448610 connected to dbus :1.12
1417448610 cluster node 1 added seq 89048
1417448610 cluster node 2 added seq 89048
1417448610 cluster node 3 added seq 89048
1417448610 cluster node 4 added seq 89048
1417448610 cluster node 5 added seq 89048
1417448610 cluster node 6 added seq 89048
1417448610 cluster node 8 added seq 89048
1417448610 our_nodeid 4 our_name map1-uat.project.domain.com
1417448611 logging mode 3 syslog f 160 p 6 logfile p 6
/var/log/cluster/fenced.log
1417448611 logfile cur mode 100644
1417448611 cpg_join fenced:daemon ...
1417448621 daemon cpg_join error retrying
1417448631 daemon cpg_join error retrying
1417448641 daemon cpg_join error retrying
1417448651 daemon cpg_join error retrying
1417448661 daemon cpg_join error retrying
1417448671 daemon cpg_join error retrying
1417448681 daemon cpg_join error retrying
1417448691 daemon cpg_join error retrying
.
.
.


[root@map1-uat ~]# clustat
Cluster Status for gibsuat @ Mon Dec  1 16:51:49 2014
Member Status: Quorate

  Member Name                                                     ID   Status
  ------ ----                                                     ---- ------
  archive1-uat.project.domain.com                                1 Online
  admin1-uat.project.domain.com                                  2 Online
  mgmt1-uat.project.domain.com                                   3 Online
  map1-uat.project.domain.com                                    4 Online, Local
  map2-uat.project.domain.com                                    5 Online
  cache1-uat.project.domain.com                                 6 Online
  data1-uat.project.domain.com                                   8 Online


The  /var/log/cluster/fenced.log on the nodes is saying Dec 01
16:02:34 fenced cpg_join error retrying every 10th of a second.

Obviously having some major issues.  These are fresh boxes, no other
services right now other then ones related to the cluster.

What OS/version?

I've also experimented with the  <cman transport="udpu"/> to disable
multicast to see if that helped but it doesn't seem to make a
difference with the node stability.

Very bad idea with >2~3 node clusters. The overhead will be far too great for a 7~9 node cluster.

Is there a document or some sort of reference that I can give the
network folks on how the switches should be configured?  I read stuff
on boards about IGMP snooping, but I couldn't find anything from
RedHat to hand them.

I have this:

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations

https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network

There are comments in there about multicast, etc.

--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux