We have 11 10-20TB GFS2 mounts that I need to share across all nodes. Its the only reason we went with the cluster solution. I don't know how we could split it up into different smaller clusters. On Mon, Dec 1, 2014 at 12:14 PM, Digimer <lists@xxxxxxxxxx> wrote: > On 01/12/14 11:56 AM, Megan . wrote: >> >> Thank you for your replies. >> >> The cluster is intended to be 9 nodes, but i haven't finished building >> the remaining 2. Our production cluster is expected to be similar in >> size. What tuning should I be looking at? >> >> >> Here is a link to our config. http://pastebin.com/LUHM8GQR I had to >> remove IP addresses. > > > Can you simplify those fencedevice definitions? I would wonder if the set > timeouts could be part of the problem. Always start with the simplest > possible configurations and only add options in response to actual issues > discovered in testing. I can try to simplify. I had the longer timeouts because what I saw happening on the physical boxes, was the box would be on its way down/up and the fence command would fail, but the box actually did come back online. The physicals take 10-15 minutes to reboot and i wasn't sure how to handle timeout issues, so i made the timeouts a bit extreme for testing. I'll try to make the config more vanilla for troubleshooting. >> I tried the method of (echo c > /proc/sysrq-trigger) to crash a node, >> the cluster kept seeing it as online and never fenced it, yet i could >> no longer ssh to the node. I did this on a physical and VM box with >> the same result. I had to fence_node node to get it to reboot, but it >> came up split brained (thinking it was the only one online). Now that >> node has cman down and the rest of the cluster sees it as still >> online. > > > Then corosync failed to detect the fault. That is a sign, to me, of a > fundamental network or configuration issue. Corosync should have shown > messages about a node being lost and reconfiguring. If that didn't happen, > then you're not even up to the point where fencing factors in. > > Did you configure corosync.conf? When it came up, did it think it was > quorate or inquorate? corosync.conf didn't work since it seems the RedHat HA Cluster doesn't use that file. http://people.redhat.com/ccaulfie/docs/CmanYinYang.pdf I tried it since we wanted to try to put the multicast traffic on a different bond/vlan but we figured out the file isn't used. >> I thought fencing was working because i'm able to do fence_node node >> and see the box reboot and come back online. I did have to get the FC >> version of the fence_agents because of an issue with the idrac agent >> not working properly. We are running fence-agents-3.1.6-1.fc14.x86_64 > > > That tells you that the configuration of the fence agents is working, but it > doesn't test failure detection. You can use the 'fence_check' tool to see if > the cluster can talk to everything, but in the end, the only useful test is > to simulate an actual crash. > > Wait; 'fc14' ?! What OS are you using? > > We are Centos 6.6. I went with the fedora core agents because of this exact issue http://forum.proxmox.com/threads/12311-Proxmox-HA-fencing-and-Dell-iDrac7 I read that it was fixed in the next version, which i could only find for FC. >> fence_tool dump worked on one of my nodes, but it is just hanging on the >> rest. >> >> [root@map1-uat ~]# fence_tool dump >> 1417448610 logging mode 3 syslog f 160 p 6 logfile p 6 >> /var/log/cluster/fenced.log >> 1417448610 fenced 3.0.12.1 started >> 1417448610 connected to dbus :1.12 >> 1417448610 cluster node 1 added seq 89048 >> 1417448610 cluster node 2 added seq 89048 >> 1417448610 cluster node 3 added seq 89048 >> 1417448610 cluster node 4 added seq 89048 >> 1417448610 cluster node 5 added seq 89048 >> 1417448610 cluster node 6 added seq 89048 >> 1417448610 cluster node 8 added seq 89048 >> 1417448610 our_nodeid 4 our_name map1-uat.project.domain.com >> 1417448611 logging mode 3 syslog f 160 p 6 logfile p 6 >> /var/log/cluster/fenced.log >> 1417448611 logfile cur mode 100644 >> 1417448611 cpg_join fenced:daemon ... >> 1417448621 daemon cpg_join error retrying >> 1417448631 daemon cpg_join error retrying >> 1417448641 daemon cpg_join error retrying >> 1417448651 daemon cpg_join error retrying >> 1417448661 daemon cpg_join error retrying >> 1417448671 daemon cpg_join error retrying >> 1417448681 daemon cpg_join error retrying >> 1417448691 daemon cpg_join error retrying >> . >> . >> . >> >> >> [root@map1-uat ~]# clustat >> Cluster Status for gibsuat @ Mon Dec 1 16:51:49 2014 >> Member Status: Quorate >> >> Member Name ID >> Status >> ------ ---- ---- >> ------ >> archive1-uat.project.domain.com 1 Online >> admin1-uat.project.domain.com 2 Online >> mgmt1-uat.project.domain.com 3 Online >> map1-uat.project.domain.com 4 Online, >> Local >> map2-uat.project.domain.com 5 Online >> cache1-uat.project.domain.com 6 Online >> data1-uat.project.domain.com 8 Online >> >> >> The /var/log/cluster/fenced.log on the nodes is saying Dec 01 >> 16:02:34 fenced cpg_join error retrying every 10th of a second. >> >> Obviously having some major issues. These are fresh boxes, no other >> services right now other then ones related to the cluster. > > > What OS/version? > >> I've also experimented with the <cman transport="udpu"/> to disable >> multicast to see if that helped but it doesn't seem to make a >> difference with the node stability. > > > Very bad idea with >2~3 node clusters. The overhead will be far too great > for a 7~9 node cluster. > >> Is there a document or some sort of reference that I can give the >> network folks on how the switches should be configured? I read stuff >> on boards about IGMP snooping, but I couldn't find anything from >> RedHat to hand them. > > > I have this: > > https://alteeve.ca/w/AN!Cluster_Tutorial_2#Six_Network_Interfaces.2C_Seriously.3F > > https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Switches > > https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network_Security_Considerations > > https://alteeve.ca/w/AN!Cluster_Tutorial_2#Network > > There are comments in there about multicast, etc. > Thank you for the links. I will review them with our network folks, hopefully it will help us sort out some of our issues. I will use the fence_check tool to see if i can troubleshoot the fencing. Thank you very much for all of your suggestions. > > -- > Digimer > Papers and Projects: https://alteeve.ca/w/ > What if the cure for cancer is trapped in the mind of a person without > access to education? > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster