Hi folks, I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've another 2-node cluster using CentOS 5.3 without problem. Well.. My scenario is as follow: * System configuration and info: http://pastebin.com/f41d63624 * Network: http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg * Switches on loop are 3Com 2924 (or 2948)-SFP * Have STP enabled (RSTP auto) * IGMP Snooping Disabled as: http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/ comment 32 * Yellow lines are a fiber link 990ft (330mts) single-mode * I'm using a dedicated tagged VLAN for cluster-heartbeat * I'm using 2 NIC's with bonding mode=1 (active/backup) for heartbeat and 4 NIC's to "public" * Every node has your public four cables plugged on same switch and Link-Aggregation on it * Looking to the picture, that 2 switches with below fiber link is where the nodes are plugged. 2 nodes each build. SAN: http://img139.imageshack.us/img139/642/clusters.jpg * Switches: Brocade TotalStorage 16SAN-B * Storages: IBM DS4700 72A (using ERM for sync replication (storage level)) My problem is: I can't get the 4 nodes up. Every time the fourth (sometimes even the third) node becomes online i got one or two of them fenced. I keep getting messages about openais/cman, cpg_mcast_joined very often: --- snipped --- Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900 Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000 --- snipped --- Is really seldom the times I can get a node to boot up and join on fence domain, almost every time it hangs and i need to reboot and try again or either reboot, enter single mode, disable cman, reboot, keep trying to service cman start/stop. Sometimes another nodes can see the node in domain but boot keeps hangs on "Starting fenced..." ######## [root@athos ~]# cman_tool services type level name id state fence 0 default 00010001 none [1 3 4] dlm 1 clvmd 00020001 none [1 3 4] [root@athos ~]# cman_tool nodes -f Node Sts Inc Joined Name 0 M 0 2009-04-29 15:16:47 /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b 1 M 7556 2009-04-29 15:16:35 athos-priv Last fenced: 2009-04-29 15:13:49 by athos-ipmi 2 X 7820 porthos-priv Last fenced: 2009-04-29 15:31:01 by porthos-ipmi Node has not been fenced since it went down 3 M 7696 2009-04-29 15:27:15 aramis-priv Last fenced: 2009-04-29 15:24:17 by aramis-ipmi 4 M 8232 2009-04-29 16:12:34 dartagnan-priv Last fenced: 2009-04-29 16:09:53 by dartagnan-ipmi [root@athos ~]# ssh root@aramis-priv ssh: connect to host aramis-priv port 22: Connection refused [root@athos ~]# ssh root@dartagnan-priv ssh: connect to host dartagnan-priv port 22: Connection refused [root@athos ~]# ######### (I know how unreliable is ssh, but I'm seeing the console screen hanged.. Just trying to show it) The BIG log file: http://pastebin.com/f453c220 Every entry on this log after 16:54h is when node2 (porthos-priv 172.16.1.2) was booting and hanged on "Starting fenced..." I've no more ideias to try solve this problem, any hints is appreciated. If you need any other info, just tell me how to get it and I'll post just after I read. Very thanks, in advance. -- Flávio do Carmo Júnior aka waKKu -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster