I didn't notice the qdisk entry, so yes, expected_votes="3" is fine. If fencing is working, then that isn't a problem. In some switches, multicast groups are occasionally deleted, forcing members to re-join the multicast group (I've not seen this myself, but I've heard of it on Cisco switches, iirc). The idea was to remove unused groups over time. Is this a new cluster? If it is, have you considered using RHEL 6.3? There are a lot of improvements in cluster stable 3. If not, can you update to RHEL 5.8 to get all the outstanding updates? Digimer On 07/16/2012 02:46 PM, Javier Vela wrote: > Hi, > > I set two_node=0 in purpose, because of I use a quorum disk with one > additional vote. If one one fails, I still have two votes, and the > cluster remains quorate, avoiding the split-brain situation. Is this > approach wrong? In my tests, this aspect of the quorum worked well. > > Fencing works very well. When something happens, the fencing kills the > faulting server without any problems. > > The first time I ran into problems I cheked multicast traffic between > the nodes with iperf and everything appeared to be OK. What I don't know > is how works the purge you said. I didn't know that any purge was > running whatsoever. How can I check if is happening? Moreover, when I > did the test only one cluster was running. Now there are 3 cluster > running in the same virtual switch. > > > Software: > > Red Hat Enterprise Linux Server release 5.7 (Tikanga) > cman-2.0.115-85.el5 > rgmanager-2.0.52-21.el5 > openais-0.80.6-30.el5 > > > Regards, Javi > > 2012/7/16 Digimer <lists@xxxxxxxxxx <mailto:lists@xxxxxxxxxx>> > > Why did you set 'two_node="0" expected_votes="3"' on a two node cluster? > With this, losing a node will mean you lose quorum and all cluster > activity will stop. Please change this to 'two_node="1" > expected_votes="1"'. > > Did you confirm that your fencing actually works? Does 'fence_node > node1' and 'fence_node node2' actually kill the target? > > Are you running into multicast issues? If your switch (virtual or real) > purges multicast groups periodically, it will break the cluster. > > What version of the cluster software and what distro are you using? > > Digimer > > > On 07/16/2012 12:03 PM, Javier Vela wrote: > > Hi, two weeks ago I asked for some help building a two-node > cluster with > > HA-LVM. After some e-mails, finally I got my cluster working. The > > problem now is that sometimes, and in some clusters (I have three > > clusters with the same configuration), I got very strange behaviours. > > > > #1 Openais detects some problem and shutdown itself. The network > is Ok, > > is a virtual device in vmware, shared with the other cluster hearbet > > networks, and only happens in one cluster. The error messages: > > > > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE > > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state > from 6. > > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state > from 0 > > > > Do you know what can I check in order to solve the problem? I > don't know > > from where I should start. What makes Openais to not receive messages? > > > > > > #2 I'm getting a lot of RGmanager errors when rgmanager tries to > change > > the service status. i.e: clusvdcam -d service. Always happens when I > > have the two nodes UP. If I shutdown one node, then the command > finishes > > succesfully. Prior to execute the command, I always check the status > > with clustat, and everything is OK: > > > > clurgmgrd[5667]: <err> #52: Failed changing RG status > > > > Another time, what can I check in order to detect problems with > > rgmanager that clustat and cman_tool doesn't show? > > > > #3 Sometimes, not always, a node that has been fenced cannot join the > > cluster after the reboot. With clustat I can see that there is quorum: > > > > clustat: > > [root@node2 ~]# clustat > > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012 > > Member Status: Quorate > > > > Member Name ID Status > > ------ ---- ---- ------ > > node1-hb 1 Offline > > node2-hb 2 Online, Local, rgmanager > > /dev/disk/by-path/pci-0000:02:01.0-scsi- 0 Online, Quorum Disk > > > > Service Name Owner (Last) State > > ------- ---- ----- ------ ----- > > service:test node2-hb started > > > > The log show how node2 fenced node1: > > > > node2 messages > > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0 > > sec post_fail_delay > > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1" > > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 > to be > > fenced > > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success > > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; > continuing > > > > But the node that tries to join the cluster says that there isn't > > quorum. Finally. It finishes inquorate, without seeing node1 and the > > quorum disk. > > > > node1 messages > > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect: > > Connection refused > > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate. Refusing > > connection. > > > > Have something in common the three errors? What should I check? I've > > discarded cluster configuration because cluster is working, and the > > errors doesn't appear in all the nodes. The most annoying error > > cureently is the #1. Every 10-15 minutes Openais fails and the nodes > > gets fenced. I attach the cluster.conf. > > > > Thanks in advance. > > > > Regards, Javi > > > > > > > > -- > > Linux-cluster mailing list > > Linux-cluster@xxxxxxxxxx <mailto:Linux-cluster@xxxxxxxxxx> > > https://www.redhat.com/mailman/listinfo/linux-cluster > > > > > -- > Digimer > Papers and Projects: https://alteeve.com > > > > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Digimer Papers and Projects: https://alteeve.com -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster