Re: RHCS 4-node cluster: Networking/Membership issues

Flavio Junior <billpp@xxxxxxxxx> · Thu, 30 Apr 2009 13:41:46 -0300

Hi Abraham, thanks for your answer.

I'd configured your suggestion to cluster.conf but still gets the same problem.

Here is what I did:
* Disable cman init script on boot for all nodes
* Edit config file and copy it for all nodes
* reboot all
* start cman on node1 (OK)
* start cman on node2 (OK)
* start cman on node3 (problems to become member, fence node2)

Here is the log file with this process 'til the fence:
http://pastebin.com/f477e7114

PS: node1 and node2 as on the same switch at site1. node3 and node4 as
on the same switch at site2.

Thanks again, any other suggestions ?

I dont know if it would help but, is corosync a feasible option for
production use?

--

Flávio do Carmo Júnior aka waKKu

On Wed, Apr 29, 2009 at 10:19 PM, Abraham Alawi <a.alawi@xxxxxxxxxxxxxx> wrote:
> If not tried already, the following settings in cluster.conf might help
> especially "clean_start"
>
> <fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/>
> clean_start --> assume the cluster is in healthy state upon startup
> post_fail_delay --> seconds to wait before fencing a node that thinks it
> should be fenced (i.e. lost connection with)
> post_join_delay --> seconds to wait before fencing any node that should be
> fenced upon startup (right after joining)
>
> On 30/04/2009, at 8:21 AM, Flavio Junior wrote:
>
>> Hi folks,
>>
>> I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
>> another 2-node cluster using CentOS 5.3 without problem.
>>
>> Well.. My scenario is as follow:
>>
>> * System configuration and info: http://pastebin.com/f41d63624
>>
>> * Network:
>> http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
>>  * Switches on loop are 3Com 2924 (or 2948)-SFP
>>  * Have STP enabled (RSTP auto)
>>  * IGMP Snooping Disabled as:
>>
>> http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/
>> comment 32
>>  * Yellow lines are a fiber link 990ft (330mts) single-mode
>>  * I'm using a dedicated tagged VLAN for cluster-heartbeat
>>  * I'm using 2 NIC's with bonding mode=1 (active/backup) for
>> heartbeat and 4 NIC's to "public"
>>  * Every node has your public four cables plugged on same switch and
>> Link-Aggregation on it
>>  * Looking to the picture, that 2 switches with below fiber link is
>> where the nodes are plugged. 2 nodes each build.
>>
>> SAN: http://img139.imageshack.us/img139/642/clusters.jpg
>>  * Switches: Brocade TotalStorage 16SAN-B
>>  * Storages: IBM DS4700 72A (using ERM for sync replication (storage
>> level))
>>
>> My problem is:
>>
>> I can't get the 4 nodes up. Every time the fourth (sometimes even the
>> third) node becomes online i got one or two of them fenced. I keep
>> getting messages about openais/cman, cpg_mcast_joined very often:
>> --- snipped ---
>> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
>> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
>> --- snipped ---
>>
>> Is really seldom the times I can get a node to boot up and join on
>> fence domain, almost every time it hangs and i need to reboot and try
>> again or either reboot, enter single mode, disable cman, reboot, keep
>> trying to service cman start/stop. Sometimes another nodes can see the
>> node in domain but boot keeps hangs on "Starting fenced..."
>>
>> ########
>> [root@athos ~]# cman_tool services
>> type             level name     id       state
>> fence            0     default  00010001 none
>> [1 3 4]
>> dlm              1     clvmd    00020001 none
>> [1 3 4]
>> [root@athos ~]# cman_tool nodes -f
>> Node  Sts   Inc   Joined               Name
>>  0   M      0   2009-04-29 15:16:47
>> /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b
>>  1   M   7556   2009-04-29 15:16:35  athos-priv
>>      Last fenced:   2009-04-29 15:13:49 by athos-ipmi
>>  2   X   7820                        porthos-priv
>>      Last fenced:   2009-04-29 15:31:01 by porthos-ipmi
>>      Node has not been fenced since it went down
>>  3   M   7696   2009-04-29 15:27:15  aramis-priv
>>      Last fenced:   2009-04-29 15:24:17 by aramis-ipmi
>>  4   M   8232   2009-04-29 16:12:34  dartagnan-priv
>>      Last fenced:   2009-04-29 16:09:53 by dartagnan-ipmi
>> [root@athos ~]# ssh root@aramis-priv
>> ssh: connect to host aramis-priv port 22: Connection refused
>> [root@athos ~]# ssh root@dartagnan-priv
>> ssh: connect to host dartagnan-priv port 22: Connection refused
>> [root@athos ~]#
>> #########
>>
>> (I know how unreliable is ssh, but I'm seeing the console screen
>> hanged.. Just trying to show it)
>>
>>
>> The BIG log file: http://pastebin.com/f453c220
>> Every entry on this log after 16:54h is when node2 (porthos-priv
>> 172.16.1.2) was booting and hanged on "Starting fenced..."
>>
>>
>> I've no more ideias to try solve this problem, any hints is
>> appreciated. If you need any other info, just tell me how to get it
>> and I'll post just after I read.
>>
>>
>> Very thanks, in advance.
>>
>> --
>>
>> Flávio do Carmo Júnior aka waKKu
>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> Abraham Alawi
>
> Unix/Linux Systems Administrator
> Science IT
> University of Auckland
> e: a.alawi@xxxxxxxxxxxxxx
> p: +64-9-373 7599, ext#: 87572
>
> ''''''''''''''''''''''''''''''''''''''''''''''''''''''
>
>

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster