I'm seeing the same problem in a 4.7 cluster. Chrissi, is there a solution or another bz for the problem ? -Mark On Wednesday 11 February 2009 10:17:30 Chrissie Caulfield wrote: > thijn wrote: > > Hi, > > > > I have the following problem. > > CMAN: removing node [server1] from the cluster : Missed too many > > heartbeats > > When the server comes back up: > > Feb 10 14:43:58 server1 kernel: CMAN: sending membership request > > after which it will try to join until the end of times. > > > > In the current problem, server2 is active and server1 has the problem > > not being able to join the cluster. > > > > The setup is a two server setup cluster. > > We have had the problem on several clusters. > > We "fixed" it usualy with rebooting the other node at which the cluster > > would repair itself and all ran smoothly from thereon. > > Naturally this will disrupt any services running on the cluster. And its > > not really a solution that will win prices. > > The problem is that server1(the problem one) is in a inquorate state and > > we are unable to get it to a quorate state, neither do we see why this > > is the case. > > We tried to use a test setup to replay the problem, we were unable. > > > > So we decided to try to find a way to fix the state of the cluster using > > the tools the system provides. > > > > The problem we see presents itself after a fence action by either node. > > When we would bring down both nodes to stabilize the issue, the cluster > > would become healthy and after that we can reboot either node and it > > will rejoin the cluster. > > It seems the problem presents itself when "pulling the plug" out of the > > server. > > We run on IBM Xservers using the SA-adapter as a fence device. > > The fence device is in a different subnet then the subnet on which the > > cluster communicates. > > Bot fence devices are on the same subnet/vlan. > > > > CentOS release 4.6 (Final) > > Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686 > > i686 i386 GNU/Linux > > cman_tool 1.0.17 (built Mar 20 2007 17:10:52) > > Copyright (C) Red Hat, Inc. 2004 All rights reserved. > > > > All versions of libraries and packages, kernel modules and all that is > > dependent for the GFS cluster to operate are identical on both nodes. > > > > Cluster.conf > > [root@server1 log]# cat /etc/cluster/cluster.conf > > <?xml version="1.0"?> > > <cluster config_version="3" name="NAME_cluster"> > > <fence_daemon post_fail_delay="0" post_join_delay="3"/> > > <clusternodes> > > <clusternode name="server1.production.loc" votes="1"> > > <fence> > > <method name="1"> > > <device name="saserver1"/> > > </method> > > </fence> > > </clusternode> > > <clusternode name="server2.production.loc" votes="1"> > > <fence> > > <method name="1"> > > <device name="saserver2"/> > > </method> > > </fence> > > </clusternode> > > </clusternodes> > > <cman expected_votes="1" two_node="1"/> > > <fencedevices> > > <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter" > > name="saserver1" passwd="XXXXXXX"/> > > <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter" > > name="saserver2" passwd="XXXXXXX"/> > > </fencedevices> > > <rm> > > <failoverdomains/> > > <resources/> > > </rm> > > </cluster> > > > > [root@server1 log]# cat /etc/hosts > > 127.0.0.1 localhost.localdomain localhost > > > > Both server are able to ping each other and also the broadcast address, > > so there is no firewall filtering UDP packets > > When i tcpdump the line i see traffic going both ways, > > > > Both servers are in the same vlan > > 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 56) server2.production.loc.6809 > > > broadcast.production.loc.6809: UDP, length 28 > > 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 140) server1.production.loc.6809 > > > server2.production.loc.6809: UDP, length 112 > > 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 56) server2.production.loc.6809 > > > broadcast.production.loc.6809: UDP, length 28 > > 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto > > 17, length: 140) server1.production.loc.6809 > > > server2.production.loc.6809.6809: UDP, length 112 > > > > Is this normal network behavior when a cluster is inquorate? > > I see that server1 is talking to server2, but server2 is only talking in > > broadcasts. > > > > When i start of try to join the cluster > > Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed > > > > [root@server1 ~]# cman_tool status > > Protocol version: 5.0.1 > > Config version: 3 > > Cluster name: NAME_cluster > > Cluster ID: 64692 > > Cluster Member: No > > Membership state: Joining > > > > [root@server2 log]# cman_tool status > > Protocol version: 5.0.1 > > Config version: 3 > > Cluster name: RWSEems_cluster > > Cluster ID: 64692 > > Cluster Member: Yes > > Membership state: Cluster-Member > > Nodes: 1 > > Expected_votes: 1 > > Total_votes: 1 > > Quorum: 1 > > Active subsystems: 7 > > Node name: server2.production.loc > > Node ID: 2 > > Node addresses: server1.production.loc > > > > [root@server1 ~]# cman_tool nodes > > Node Votes Exp Sts Name > > > > [root@server2 log]# cman_tool nodes > > Node Votes Exp Sts Name > > 1 1 1 X server1.production.loc > > 2 1 1 M server2.production.loc > > > > When i start cman > > service cman start > > > > Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a > > Linux-cluster > > Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture > > via: CMAN/SM Plugin v1.1.7.4 > > Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate > > > > > > It seems to me that this should be fixable with the tools as provided > > with the RedHat Cluster Suite, without disturbing the running cluster. > > It seems quite insane if i need to restart my cluster to have it all > > working again.. kinda spoils the idea of running a cluster. > > This setup is running in a HA envirmoment and we can have nearly to no > > downtime. > > > > The logs on the healthy server (server2) does not mention/complain > > anything of errors when rebooting, restarting cman or when server1 want > > to join the cluster. > > We see no disallowed, refused or anything that server2 is not willing to > > play with server1 > > > > I have been looking at this thing for a while now.. am i missing > > anything? > > This is a known bug, see > > https://bugzilla.redhat.com/show_bug.cgi?id=475293 > > It's fixed in 4.7 or you can run a program to set up a workaround. > > Having said that I have heard reports of is still happening in some > circumstances ... but I don't have any more detail > > -- > > Chrissie > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Dipl.-Ing. Mark Hlawatschek -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster