If not tried already, the following settings in cluster.conf might
help especially "clean_start"
<fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/>
clean_start --> assume the cluster is in healthy state upon startup
post_fail_delay --> seconds to wait before fencing a node that thinks
it should be fenced (i.e. lost connection with)
post_join_delay --> seconds to wait before fencing any node that
should be fenced upon startup (right after joining)
On 30/04/2009, at 8:21 AM, Flavio Junior wrote:
Hi folks,
I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
another 2-node cluster using CentOS 5.3 without problem.
Well.. My scenario is as follow:
* System configuration and info: http://pastebin.com/f41d63624
* Network: http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
* Switches on loop are 3Com 2924 (or 2948)-SFP
* Have STP enabled (RSTP auto)
* IGMP Snooping Disabled as:
comment 32
* Yellow lines are a fiber link 990ft (330mts) single-mode
* I'm using a dedicated tagged VLAN for cluster-heartbeat
* I'm using 2 NIC's with bonding mode=1 (active/backup) for
heartbeat and 4 NIC's to "public"
* Every node has your public four cables plugged on same switch and
Link-Aggregation on it
* Looking to the picture, that 2 switches with below fiber link is
where the nodes are plugged. 2 nodes each build.
SAN: http://img139.imageshack.us/img139/642/clusters.jpg
* Switches: Brocade TotalStorage 16SAN-B
* Storages: IBM DS4700 72A (using ERM for sync replication (storage
My problem is:
I can't get the 4 nodes up. Every time the fourth (sometimes even the
third) node becomes online i got one or two of them fenced. I keep
getting messages about openais/cman, cpg_mcast_joined very often:
--- snipped ---
Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
--- snipped ---
Is really seldom the times I can get a node to boot up and join on
fence domain, almost every time it hangs and i need to reboot and try
again or either reboot, enter single mode, disable cman, reboot, keep
trying to service cman start/stop. Sometimes another nodes can see the
node in domain but boot keeps hangs on "Starting fenced..."
[root@athos ~]# cman_tool services
type level name id state
fence 0 default 00010001 none
[1 3 4]
dlm 1 clvmd 00020001 none
[1 3 4]
[root@athos ~]# cman_tool nodes -f
Node Sts Inc Joined Name
0 M 0 2009-04-29 15:16:47
1 M 7556 2009-04-29 15:16:35 athos-priv
Last fenced: 2009-04-29 15:13:49 by athos-ipmi
2 X 7820 porthos-priv
Last fenced: 2009-04-29 15:31:01 by porthos-ipmi
Node has not been fenced since it went down
3 M 7696 2009-04-29 15:27:15 aramis-priv
Last fenced: 2009-04-29 15:24:17 by aramis-ipmi
4 M 8232 2009-04-29 16:12:34 dartagnan-priv
Last fenced: 2009-04-29 16:09:53 by dartagnan-ipmi
[root@athos ~]# ssh root@aramis-priv
ssh: connect to host aramis-priv port 22: Connection refused
[root@athos ~]# ssh root@dartagnan-priv
ssh: connect to host dartagnan-priv port 22: Connection refused
[root@athos ~]#
(I know how unreliable is ssh, but I'm seeing the console screen
hanged.. Just trying to show it)
The BIG log file: http://pastebin.com/f453c220
Every entry on this log after 16:54h is when node2 (porthos-priv was booting and hanged on "Starting fenced..."
I've no more ideias to try solve this problem, any hints is
appreciated. If you need any other info, just tell me how to get it
and I'll post just after I read.
Very thanks, in advance.
Flávio do Carmo Júnior aka waKKu
Linux-cluster mailing list
Abraham Alawi
Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi@xxxxxxxxxxxxxx
p: +64-9-373 7599, ext#: 87572
Linux-cluster mailing list