RHCS 4-node cluster: Networking/Membership issues

Flavio Junior <billpp@xxxxxxxxx> · Wed, 29 Apr 2009 17:21:56 -0300

Hi folks,

I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
another 2-node cluster using CentOS 5.3 without problem.

Well.. My scenario is as follow:

* System configuration and info: http://pastebin.com/f41d63624

* Network: http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
  * Switches on loop are 3Com 2924 (or 2948)-SFP
  * Have STP enabled (RSTP auto)
  * IGMP Snooping Disabled as:
http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/
comment 32
  * Yellow lines are a fiber link 990ft (330mts) single-mode
  * I'm using a dedicated tagged VLAN for cluster-heartbeat
  * I'm using 2 NIC's with bonding mode=1 (active/backup) for
heartbeat and 4 NIC's to "public"
  * Every node has your public four cables plugged on same switch and
Link-Aggregation on it
  * Looking to the picture, that 2 switches with below fiber link is
where the nodes are plugged. 2 nodes each build.

SAN: http://img139.imageshack.us/img139/642/clusters.jpg
  * Switches: Brocade TotalStorage 16SAN-B
  * Storages: IBM DS4700 72A (using ERM for sync replication (storage level))

My problem is:

I can't get the 4 nodes up. Every time the fourth (sometimes even the
third) node becomes online i got one or two of them fenced. I keep
getting messages about openais/cman, cpg_mcast_joined very often:
--- snipped ---
Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
--- snipped ---

Is really seldom the times I can get a node to boot up and join on
fence domain, almost every time it hangs and i need to reboot and try
again or either reboot, enter single mode, disable cman, reboot, keep
trying to service cman start/stop. Sometimes another nodes can see the
node in domain but boot keeps hangs on "Starting fenced..."

########
[root@athos ~]# cman_tool services
type             level name     id       state
fence            0     default  00010001 none
[1 3 4]
dlm              1     clvmd    00020001 none
[1 3 4]
[root@athos ~]# cman_tool nodes -f
Node  Sts   Inc   Joined               Name
   0   M      0   2009-04-29 15:16:47
/dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b
   1   M   7556   2009-04-29 15:16:35  athos-priv
       Last fenced:   2009-04-29 15:13:49 by athos-ipmi
   2   X   7820                        porthos-priv
       Last fenced:   2009-04-29 15:31:01 by porthos-ipmi
       Node has not been fenced since it went down
   3   M   7696   2009-04-29 15:27:15  aramis-priv
       Last fenced:   2009-04-29 15:24:17 by aramis-ipmi
   4   M   8232   2009-04-29 16:12:34  dartagnan-priv
       Last fenced:   2009-04-29 16:09:53 by dartagnan-ipmi
[root@athos ~]# ssh root@aramis-priv
ssh: connect to host aramis-priv port 22: Connection refused
[root@athos ~]# ssh root@dartagnan-priv
ssh: connect to host dartagnan-priv port 22: Connection refused
[root@athos ~]#
#########

(I know how unreliable is ssh, but I'm seeing the console screen
hanged.. Just trying to show it)

The BIG log file: http://pastebin.com/f453c220
Every entry on this log after 16:54h is when node2 (porthos-priv
172.16.1.2) was booting and hanged on "Starting fenced..."

I've no more ideias to try solve this problem, any hints is
appreciated. If you need any other info, just tell me how to get it
and I'll post just after I read.

Very thanks, in advance.

--

Flávio do Carmo Júnior aka waKKu

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster