Re: Strange behaviours in two-node cluster

Digimer <lists@xxxxxxxxxx> · Tue, 17 Jul 2012 10:39:29 -0400

I didn't notice the qdisk entry, so yes, expected_votes="3" is fine.

If fencing is working, then that isn't a problem.

In some switches, multicast groups are occasionally deleted, forcing
members to re-join the multicast group (I've not seen this myself, but
I've heard of it on Cisco switches, iirc). The idea was to remove unused
groups over time.

Is this a new cluster? If it is, have you considered using RHEL 6.3?
There are a lot of improvements in cluster stable 3. If not, can you
update to RHEL 5.8 to get all the outstanding updates?

Digimer

On 07/16/2012 02:46 PM, Javier Vela wrote:
> Hi,
> 
> I set two_node=0 in purpose, because of I use a quorum disk with one
> additional vote. If one one fails, I still have two votes, and the
> cluster remains quorate, avoiding the split-brain situation. Is this
> approach wrong? In my tests, this aspect of the quorum worked well.
> 
> Fencing works very well. When something happens, the fencing kills the
> faulting server without any problems.
> 
> The first time I ran into problems I cheked multicast traffic between
> the nodes with iperf and everything appeared to be OK. What I don't know
> is how works the purge you said. I didn't know that any purge was
> running whatsoever. How can I check if is happening? Moreover, when I
> did the test only one cluster was running. Now there are 3 cluster
> running in the same virtual switch.
> 
> 
> Software:
> 
> Red Hat Enterprise Linux Server release 5.7 (Tikanga)
> cman-2.0.115-85.el5
> rgmanager-2.0.52-21.el5
> openais-0.80.6-30.el5
> 
> 
>  Regards, Javi
> 
> 2012/7/16 Digimer <lists@xxxxxxxxxx <mailto:lists@xxxxxxxxxx>>
> 
>     Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
>     With this, losing a node will mean you lose quorum and all cluster
>     activity will stop. Please change this to 'two_node="1"
>     expected_votes="1"'.
> 
>     Did you confirm that your fencing actually works? Does 'fence_node
>     node1' and 'fence_node node2' actually kill the target?
> 
>     Are you running into multicast issues? If your switch (virtual or real)
>     purges multicast groups periodically, it will break the cluster.
> 
>     What version of the cluster software and what distro are you using?
> 
>     Digimer
> 
> 
>     On 07/16/2012 12:03 PM, Javier Vela wrote:
>     > Hi, two weeks ago I asked for some help building a two-node
>     cluster with
>     > HA-LVM. After some e-mails, finally I got my cluster working. The
>     > problem now is that sometimes, and in some clusters (I have three
>     > clusters with the same configuration), I got very strange behaviours.
>     >
>     > #1 Openais detects some problem and shutdown itself. The network
>     is Ok,
>     > is a virtual device in vmware, shared with the other cluster hearbet
>     > networks, and only happens in one cluster. The error messages:
>     >
>     > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
>     > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state
>     from 6.
>     > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state
>     from 0
>     >
>     > Do you know what can I check in order to solve the problem? I
>     don't know
>     > from where I should start. What makes Openais to not receive messages?
>     >
>     >
>     > #2 I'm getting a lot of RGmanager errors when rgmanager tries to
>     change
>     > the service status. i.e: clusvdcam -d service. Always happens when I
>     > have the two nodes UP. If I shutdown one node, then the command
>     finishes
>     > succesfully. Prior to execute the command, I always check the status
>     > with clustat, and everything is OK:
>     >
>     > clurgmgrd[5667]: <err> #52: Failed changing RG status
>     >
>     > Another time, what can I check in order to detect problems with
>     > rgmanager that clustat and cman_tool doesn't show?
>     >
>     > #3 Sometimes, not always, a node that has been fenced cannot join the
>     > cluster after the reboot. With clustat I can see that there is quorum:
>     >
>     > clustat:
>     > [root@node2 ~]# clustat
>     > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
>     > Member Status: Quorate
>     >
>     >  Member Name                             ID   Status
>     >  ------ ----                             ---- ------
>     >  node1-hb                                  1 Offline
>     >  node2-hb                               2 Online, Local, rgmanager
>     >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
>     >
>     >  Service Name                   Owner (Last)                   State
>     >  ------- ----                   ----- ------                   -----
>     >  service:test                   node2-hb                  started
>     >
>     > The log show how node2 fenced node1:
>     >
>     > node2 messages
>     > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
>     > sec post_fail_delay
>     > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
>     > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1
>     to be
>     > fenced
>     > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
>     > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced;
>     continuing
>     >
>     > But the node that tries to join the cluster says that there isn't
>     > quorum. Finally. It finishes inquorate, without seeing node1 and the
>     > quorum disk.
>     >
>     > node1 messages
>     > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
>     > Connection refused
>     > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
>     > connection.
>     >
>     > Have something in common the three errors?  What should I check? I've
>     > discarded cluster configuration because cluster is working, and the
>     > errors doesn't appear in all the nodes. The most annoying error
>     > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
>     > gets fenced. I attach the cluster.conf.
>     >
>     > Thanks in advance.
>     >
>     > Regards, Javi
>     >
>     >
>     >
>     > --
>     > Linux-cluster mailing list
>     > Linux-cluster@xxxxxxxxxx <mailto:Linux-cluster@xxxxxxxxxx>
>     > https://www.redhat.com/mailman/listinfo/linux-cluster
>     >
> 
> 
>     --
>     Digimer
>     Papers and Projects: https://alteeve.com
> 
> 
> 
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 

-- 
Digimer
Papers and Projects: https://alteeve.com

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster