Hi, I have the following problem. CMAN: removing node [server1] from the cluster : Missed too many heartbeats When the server comes back up: Feb 10 14:43:58 server1 kernel: CMAN: sending membership request after which it will try to join until the end of times. In the current problem, server2 is active and server1 has the problem not being able to join the cluster. The setup is a two server setup cluster. We have had the problem on several clusters. We "fixed" it usualy with rebooting the other node at which the cluster would repair itself and all ran smoothly from thereon. Naturally this will disrupt any services running on the cluster. And its not really a solution that will win prices. The problem is that server1(the problem one) is in a inquorate state and we are unable to get it to a quorate state, neither do we see why this is the case. We tried to use a test setup to replay the problem, we were unable. So we decided to try to find a way to fix the state of the cluster using the tools the system provides. The problem we see presents itself after a fence action by either node. When we would bring down both nodes to stabilize the issue, the cluster would become healthy and after that we can reboot either node and it will rejoin the cluster. It seems the problem presents itself when "pulling the plug" out of the server. We run on IBM Xservers using the SA-adapter as a fence device. The fence device is in a different subnet then the subnet on which the cluster communicates. Bot fence devices are on the same subnet/vlan. CentOS release 4.6 (Final) Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686 i686 i386 GNU/Linux cman_tool 1.0.17 (built Mar 20 2007 17:10:52) Copyright (C) Red Hat, Inc. 2004 All rights reserved. All versions of libraries and packages, kernel modules and all that is dependent for the GFS cluster to operate are identical on both nodes. Cluster.conf [root@server1 log]# cat /etc/cluster/cluster.conf <?xml version="1.0"?> <cluster config_version="3" name="NAME_cluster"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="server1.production.loc" votes="1"> <fence> <method name="1"> <device name="saserver1"/> </method> </fence> </clusternode> <clusternode name="server2.production.loc" votes="1"> <fence> <method name="1"> <device name="saserver2"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="1" two_node="1"/> <fencedevices> <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter" name="saserver1" passwd="XXXXXXX"/> <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter" name="saserver2" passwd="XXXXXXX"/> </fencedevices> <rm> <failoverdomains/> <resources/> </rm> </cluster> [root@server1 log]# cat /etc/hosts 127.0.0.1 localhost.localdomain localhost Both server are able to ping each other and also the broadcast address, so there is no firewall filtering UDP packets When i tcpdump the line i see traffic going both ways, Both servers are in the same vlan 14:51:28.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 56) server2.production.loc.6809 > broadcast.production.loc.6809: UDP, length 28 14:51:28.703277 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 140) server1.production.loc.6809 > server2.production.loc.6809: UDP, length 112 14:51:33.703240 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 56) server2.production.loc.6809 > broadcast.production.loc.6809: UDP, length 28 14:51:33.703310 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto 17, length: 140) server1.production.loc.6809 > server2.production.loc.6809.6809: UDP, length 112 Is this normal network behavior when a cluster is inquorate? I see that server1 is talking to server2, but server2 is only talking in broadcasts. When i start of try to join the cluster Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed [root@server1 ~]# cman_tool status Protocol version: 5.0.1 Config version: 3 Cluster name: NAME_cluster Cluster ID: 64692 Cluster Member: No Membership state: Joining [root@server2 log]# cman_tool status Protocol version: 5.0.1 Config version: 3 Cluster name: RWSEems_cluster Cluster ID: 64692 Cluster Member: Yes Membership state: Cluster-Member Nodes: 1 Expected_votes: 1 Total_votes: 1 Quorum: 1 Active subsystems: 7 Node name: server2.production.loc Node ID: 2 Node addresses: server1.production.loc [root@server1 ~]# cman_tool nodes Node Votes Exp Sts Name [root@server2 log]# cman_tool nodes Node Votes Exp Sts Name 1 1 1 X server1.production.loc 2 1 1 M server2.production.loc When i start cman service cman start Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a Linux-cluster Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture via: CMAN/SM Plugin v1.1.7.4 Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate It seems to me that this should be fixable with the tools as provided with the RedHat Cluster Suite, without disturbing the running cluster. It seems quite insane if i need to restart my cluster to have it all working again.. kinda spoils the idea of running a cluster. This setup is running in a HA envirmoment and we can have nearly to no downtime. The logs on the healthy server (server2) does not mention/complain anything of errors when rebooting, restarting cman or when server1 want to join the cluster. We see no disallowed, refused or anything that server2 is not willing to play with server1 I have been looking at this thing for a while now.. am i missing anything? Thank you in advance -- with kind regards, E.Novation Hosting Center Thijn van der Schoot Operations: Unix & Network -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster