Re: Starting two-node cluster with only one node

"Marc - A. Dahlhaus" <mad@xxxxxx> · Sat, 18 Jul 2009 16:02:13 +0200

Hello,

as your cluster worked well on centos 5.2 the networking hardware 
components couldn't be the culprit in this case but is still think that 
it is an cluster communication related problem.

It could be your iptables ruleset... Try to disable the firewall and 
check again...

You can use tshark to check this as well in this case by using something 
like this:

tshark -i <interface cluster is useing> -f 'host <multicast-ip cluster 
is useing>' -V | less

Have you checked that openais is still chkconfig off after your upgrade?

Abed-nego G. Escobal, Jr. schrieb:
Thanks for giving the pointers!

uname -r on both nodes

2.6.18-128.1.16.el5

on node01

rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
cman-2.0.98-2chrissie
gfs-utils-0.1.18-1.el5
kmod-gfs-0.1.23-5.el5_2.4
kmod-gfs-0.1.31-3.el5
modcluster-0.12.1-2.el5.centos
ricci-0.12.1-7.3.el5.centos.1
luci-0.12.1-7.3.el5.centos.1
cluster-snmp-0.12.1-2.el5.centos
iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
lvm2-cluster-2.02.40-7.el5
openais-0.80.3-22.el5_3.8
oddjob-0.27-9.el5
rgmanager-2.0.46-1.el5.centos.3

on node02

rpm -q cman gfs-utils kmod-gfs modcluster ricci luci cluster-snmp iscsi-initiator-utils lvm2-cluster openais oddjob rgmanager
cman-2.0.98-2chrissie
gfs-utils-0.1.18-1.el5
kmod-gfs-0.1.31-3.el5
modcluster-0.12.1-2.el5.centos
ricci-0.12.1-7.3.el5.centos.1
luci-0.12.1-7.3.el5.centos.1
cluster-snmp-0.12.1-2.el5.centos
iscsi-initiator-utils-6.2.0.868-0.18.el5_3.1
lvm2-cluster-2.02.40-7.el5
openais-0.80.3-22.el5_3.8
oddjob-0.27-9.el5
rgmanager-2.0.46-1.el5.centos.3

I used http://knowledgelayer.softlayer.com/questions/443/GFS+howto to configure my cluster. When it was still on 5.2 the cluster worked, but after the recent update to 5.3, it broke.

On one of the threads that I have found in the archive, it states that there is a problem with the most current official version of cman, bug id 485026. I replaced the most current cman package with cman-2.0.98-2chrissie because I tested if this was my problem, seems not so I will be moving back to the official package.
I also found on another thread that openais was the culprit, changed it back to openais-0.80.3-15.el5 even though the change log indicates a lot of bug fixes were done on the most current official package. After doing it, it still did not work. I tried clean_start="1" with caution. I unmounted the iscsi then started cman but still it did not work. The most recent is post_join_delay="-1", I did not noticed that there was a man for fenced, which is much safer than clean_start="1" but still it did not fixed it. The man pages that I have read over and over again is cman and cluster.conf. Some pages in the online manual is somewhat not suitable for my situation because I do not have X installed on the machines and some pages in the online manual used system-config-cluster.

As I understand in the online manual and FAQ, qdisk is not required if I have two_nodes="1" so I did not create any. I have removed the fence_daemon tag since I only used it for trying the solutions that were suggested. The hosts are present in each others hosts with correct ips.

The ping results

ping node02.company.com

--- node01.company.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.010/0.016/0.034/0.007 ms

ping node01.company.com

--- node01.company.com ping statistics ---
10 packets transmitted, 10 received, 0% packet loss, time 9003ms
rtt min/avg/max/mdev = 0.341/0.668/1.084/0.273 ms

According to the people in the data center, the switch supports multicast communication on all ports that are used for cluster communication because they are in the same VLAN.

For the logs, I will sending fresh logs as soon as possible. Currently I have not enough time window to bring down the machine.

For the wireshark, I will be reading the man pages on how to use it.

Please advise if any other information is needed to solve this. I am very grateful for the very detailed pointers. Thank you very much! 

--- On Fri, 7/17/09, Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad@xxxxxx> wrote:

From: Marc - A. Dahlhaus [ Administration | Westermann GmbH ] <mad@xxxxxx>
Subject: Re:  Starting two-node cluster with only one node
To: "linux clustering" <linux-cluster@xxxxxxxxxx>
Date: Friday, 17 July, 2009, 5:56 PM
Hello,

can you give us some hard facts on what versions of
cluster-suite
packages you are using in your environment and also the
related logs?

Have you read the corresponding parts of the cluster suites
manual, man
pages, FAQ and also searched the list-archives for similar
problems
already? If not -> do it, there are may good hints to
find there.

The nodes find each other and create a cluster very fast IF
they can
talk to each other. As no cluster networking is involved in
fencing a
remote node if the fencing node by itself is quorate this
could be your
problem.

You should change to fence_manual and switch back to your
real fencing
devices after you have debuged your problem. Also get rid
of the
<fence_daemon ... /> tag in your cluster.conf as
fenced does the right
thing by default if the remaining configuration is right
and now it is
just hiding a part of the problem.

Also the 5 minute break on cman start smells like a
DNS-lookup problem
or other network related problem to me.

Here is a short check-list to be sure the nodes can talk to
each other:

Can the individual nodes ping each other?

Can the individual nodes dns-lookup the other node-names
(which you used
in your cluster.conf)? (Try to add them to your etc/hosts
file, that way
you have a working cluster even if your dns-system is going
on
vacation.)

Is your switch allowing multicast communication on all
ports that are
used for cluster communication? (This is a prerequisite for
openais /
corosync based cman which would be anything >= RHEL 5.
Search the
archives on this if you need more info...)

Can you trace (eg. with wiresharks tshark) incoming
cluster
communication from remote nodes? (If you don't changed your
fencing to
fence_manual your listening system will get fenced before
you can get
any useful information out of it. Try with and without
active firewall.)

If all above could be answered with "yes" your cluster
should form just
fine. You could try to add a qdisk-device as tiebreaker
after that and
test it just to be sure you have a working last man
standing setup...

Hope that helps,

Marc

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster