Thanks for giving the pointers! uname -r on both nodes 2.6.18-128.1.16.el5 on node01 rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager cman-2.0.98-2chrissie gfs-utils-0.1.18-1.el5 kmod-gfs-0.1.23-5.el5_2.4 kmod-gfs-0.1.31-3.el5 modcluster-0.12.1-2.el5.centos ricci-0.12.1-7.3.el5.centos.1 luci-0.12.1-7.3.el5.centos.1 cluster-snmp-0.12.1-2.el5.centos iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 lvm2-cluster-2.02.40-7.el5 openais-0.80.3-22.el5_3.8 oddjob-0.27-9.el5 rgmanager-2.0.46-1.el5.centos.3 on node02 rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager cman-2.0.98-2chrissie gfs-utils-0.1.18-1.el5 kmod-gfs-0.1.31-3.el5 modcluster-0.12.1-2.el5.centos ricci-0.12.1-7.3.el5.centos.1 luci-0.12.1-7.3.el5.centos.1 cluster-snmp-0.12.1-2.el5.centos iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1 lvm2-cluster-2.02.40-7.el5 openais-0.80.3-22.el5_3.8 oddjob-0.27-9.el5 rgmanager-2.0.46-1.el5.centos.3 I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke. On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package. I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster. As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips. The ping results ping node02.company.com --- node01.company.com ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms ping node01.company.com --- node01.company.com ping statistics --- 10 packets transmitted, 10 received, 0% packet loss, time 9003ms rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN. For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine. For the wireshark, I will be reading the man pages on how to use it. Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! --- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad@xxxxxx> wrote: > From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad@xxxxxx> > Subject: Re: Starting two-node cluster with only one node > To: "linux clustering" <linux-cluster@xxxxxxxxxx> > Date: Friday, 17 July, 2009, 5:56 PM > Hello, > > > can you give us some hard facts on what versions of > cluster-suite > packages you are using in your environment and also the > related logs? > > Have you read the corresponding parts of the cluster suites > manual, man > pages, FAQ and also searched the list-archives for similar > problems > already? If not -> do it, there are may good hints to > find there. > > > The nodes find each other and create a cluster very fast IF > they can > talk to each other. As no cluster networking is involved in > fencing a > remote node if the fencing node by itself is quorate this > could be your > problem. > > You should change to fence_manual and switch back to your > real fencing > devices after you have debuged your problem. Also get rid > of the > <fence_daemon ... /> tag in your cluster.conf as > fenced does the right > thing by default if the remaining configuration is right > and now it is > just hiding a part of the problem. > > Also the 5 minute break on cman start smells like a > DNS-lookup problem > or other network related problem to me. > > Here is a short check-list to be sure the nodes can talk to > each other: > > Can the individual nodes ping each other? > > Can the individual nodes dns-lookup the other node-names > (which you used > in your cluster.conf)? (Try to add them to your etc/hosts > file, that way > you have a working cluster even if your dns-system is going > on > vacation.) > > Is your switch allowing multicast communication on all > ports that are > used for cluster communication? (This is a prerequisite for > openais / > corosync based cman which would be anything >= RHEL 5. > Search the > archives on this if you need more info...) > > Can you trace (eg. with wiresharks tshark) incoming > cluster > communication from remote nodes? (If you don't changed your > fencing to > fence_manual your listening system will get fenced before > you can get > any useful information out of it. Try with and without > active firewall.) > > If all above could be answered with "yes" your cluster > should form just > fine. You could try to add a qdisk-device as tiebreaker > after that and > test it just to be sure you have a working last man > standing setup... > > Hope that helps, > > Marc > > Am Donnerstag, den 16.07.2009, 23:41 -0700 schrieb > Abed-nego G. Escobal, > Jr.: > > > > Thanks for the tip. It helped by stopping each node > kicking each other, as per the logs, but still I have a > split brain status. > > > > On node01 > > > > # /usr/sbin/cman_tool nodes > > Node > Sts Inc Joined > Name > > 1 M > 680 2009-07-17 00:30:42 > node01.company.com > > 2 X > 0 > node02.company.com > > > > # /usr/sbin/clustat > > Cluster Status for GFSCluster @ Fri Jul 17 01:01:09 > 2009 > > Member Status: Quorate > > > > Member Name > > ID Status > > ------ ---- > > ---- ------ > > node01.company.com > > 1 Online, Local > > node02.company.com > > 2 Offline > > > > > > On node02 > > > > # /usr/sbin/cman_tool nodes > > Node > Sts Inc Joined > Name > > 1 X > 0 > node01.company.com > > 2 M > 676 2009-07-17 00:30:43 > node02.company.com > > > > > > # /usr/sbin/clustat > > Cluster Status for GFSCluster @ Fri Jul 17 01:01:22 > 2009 > > Member Status: Quorate > > > > Member Name > > ID Status > > ------ ---- > > ---- ------ > > node01.company.com > > 1 Offline > > node02.company.com > > 2 Online, Local > > > > > > Another thing that I have noticed, > > > > 1. Start node01 with only itself as the member of the > cluster > > 2. Update cluster.conf to have node02 as an additional > member > > 3. Start node02 > > > > Yields both nodes being quorate (split brain) but only > node02 tries to fence out node01. After some time, clustat > will yield both of them being in the same cluster. Then I > will be starting clvmd on node02 but will not be successful. > After trying to start the clvmd service, clustat will yield > split brain again. > > > > Are there some troubleshootings that I should be > doing? > > > > > > --- On Thu, 7/16/09, Aaron Benner <tfrumbacher@xxxxxxxxx> > wrote: > > > > > From: Aaron Benner <tfrumbacher@xxxxxxxxx> > > > Subject: Re: Starting two-node > cluster with only one node > > > To: "linux clustering" <linux-cluster@xxxxxxxxxx> > > > Date: Thursday, 16 July, 2009, 10:04 PM > > > Have you tried setting the > > > "post_join_delay" value in the <fence_daemon > ...> > > > declaration to -1? > > > > > > <fence_daemon clean_start="0" > post_fail_delay="0" > > > post_join_delay="-1" /> > > > > > > This is a hint I picked up from the fenced man > page section > > > on avoiding boot time fencing. It tells > fenced to wait > > > until all of the nodes have joined the cluster > before > > > starting up. We use this on a couple of 2 > node > > > clusters (with qdisk) to allow them to start up > without the > > > first node to grab the quorum disk fencing the > other node. > > > > > > --Aaron > > > > > > On Jul 16, 2009, at 12:16 AM, Abed-nego G. > Escobal, Jr. > > > wrote: > > > > > > > > > > > > > > > Tried it and now the two node cluster is > running with > > > only one node. My problem right now is how to > force the > > > second node to join the first node's cluster. > Right now it > > > is creating its own cluster and trying to fence > the first > > > node. I tried cman_tool leave on the second node > but I got > > > > > > > > cman_tool: Error leaving cluster: Device or > resource > > > busy > > > > > > > > clvmd and gfs are not running on the second > node. What > > > is running on the second node is cman. When I > did > > > > > > > > service cman start > > > > > > > > It took 5 approximately 5 minutes before I > got the > > > [ok] meassage. Am I missing something here? Not > doing right? > > > Should be doing something? > > > > > > > > > > > > --- On Thu, 7/16/09, Abed-nego G. Escobal, > Jr. <abednegoyulo@xxxxxxxxx> > > > wrote: > > > > > > > >> From: Abed-nego G. Escobal, Jr. <abednegoyulo@xxxxxxxxx> > > > >> Subject: Starting > two-node cluster > > > with only one node > > > >> To: "linux clustering" <linux-cluster@xxxxxxxxxx> > > > >> Date: Thursday, 16 July, 2009, 10:46 AM > > > >> > > > >> Using the config file below > > > >> > > > >> <?xml version="1.0"?> > > > >> <cluster name="GFSCluster" > > > config_version="5"> > > > >> <cman expected_votes="1" > two_node="1"/> > > > > >> <clusternodes><clusternode > > > >> name="node01.company.com" votes="1" > > > >> nodeid="1"><fence><method > > > >> name="single"><device > > > >> > > > > name="node01_ipmi"/></method></fence></clusternode><clusternode > > > >> name="node02.company.com" votes="1" > > > >> nodeid="2"><fence><method > > > >> name="single"><device > > > >> > > > > name="node02_ipmi"/></method></fence></clusternode></clusternodes> > > > > >> <fencedevices><fencedevice > > > >> name="node01_ipmi" > agent="fence_ipmilan" > > > ipaddr="10.1.0.5" > > > >> login="root" > > > passwd="********"/><fencedevice > > > >> name="node02_ipmi" > agent="fence_ipmilan" > > > ipaddr="10.1.0.7" > > > >> login="root" > > > passwd="********"/></fencedevices> > > > >> <rm> > > > >> > <failoverdomains/> > > > >> > <resources/> > > > >> </rm> > > > >> </cluster> > > > >> > > > >> Is it possible to start the cluster by > only > > > bringing up one > > > >> node? The reason why I asked is because > currently > > > bringing > > > >> them up together produces a split brain, > each of > > > them member > > > >> of the cluster GFSCluster of their own > fencing > > > each other. > > > >> My plan is to bring up only one node to > create a > > > quorum then > > > >> bring the other one up and manually join > it to the > > > existing > > > >> cluster. > > > >> > > > >> I have already don the start_clean > approach but it > > > seems it > > > >> does not work. > > > >> > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster