Re: Cluster is failed.

שלום קלמר <sklemer@xxxxxxxxx> · Mon, 8 Mar 2010 13:37:22 +0200

Hi.

you got some errors on your cluster.conf file.

1. you must check that fence_rsa works berfore starting the cluster.
2. if you are using  quorumd, change cman to : <cman expected_votes="3" two_node="0"/>

3. put quorumd  votes=1 , min_score=1
4. change your heuristic program to somthing like ping to your router.( its better to add more heuristics )
5. install most updated rpms of cman openais & rgmanager.
6. clustat should show qdisk is online. cman should start  qdiskd .

for more information you can read :

http://sources.redhat.com/cluster/wiki/FAQ/CMAN  ( it helps me . )

Regards

Shalom.

On Mon, Mar 8, 2010 at 12:44 PM,  <mogruith@xxxxxxx> wrote:

Hi all

Here is my cluster.conf:

<?xml version="1.0"?>

<cluster config_version="6" name="TEST">

        <quorumd device="/dev/vg1quorom/lv1quorom" interval="1" label="quorum"

min_score="3" tko="10" votes="3">

                <heuristic interval="2" program="/usr/sbin/qdiskd" score="1"/>

        </quorumd>

        <fence_daemon post_fail_delay="0" post_join_delay="3"/>

        <cman expected_votes="1" two_node="1"/>

        <clusternodes>

                <clusternode name="node1" nodeid="1" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="RSA_node1"/>

                                </method>

                        </fence>

                </clusternode>

                <clusternode name="node2" nodeid="2" votes="1">

                        <fence>

                                <method name="1">

                                        <device name="RSA_node2"/>

                                </method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman/>

        <fencedevices>

                <fencedevice agent="fence_rsa" ipaddr="RSA_node1" login="USER"

name="RSA_node1" passwd="PASSWORD"/>

                <fencedevice agent="fence_rsa" ipaddr="RSA_node2" login="USER"

name="RSA_node2" passwd="PASSWORD"/>

        </fencedevices>

        <rm>

                <failoverdomains>

                        <failoverdomain name="TEST" ordered="1" restricted="1">

                                <failoverdomainnode name="node1" priority="1"/>

                                <failoverdomainnode name="node2" priority="2"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <ip address="172.28.104.80" monitor_link="1"/>

                        <clusterfs device="/dev/vg1data/lv1data"

force_unmount="0" fsid="30516" fstype="gfs2" mountpoint="/data" name="DATA"

options=""/>

                </resources>

                <service autostart="1" domain="TEST" exclusive="1" name="TEST">

                        <ip ref="172.28.104.80">

                                <clusterfs ref="DATA"/>

                        </ip>

                </service>

        </rm>

</cluster>

N.B

node1, node2 , RSA_node1 and RSA_node2 are set in /etc/hosts

When I move service from node1 to node2 (by a force reboot on node1), it fails

(because of probably a network problem) but is there a timeout ? If node2 can't

connect to rsa node1, why it doesnt consider that node1is "dead" and why service

doesn't go on node2 ?

Here is the clustat

[root@node2 ~]# clustat

Cluster Status for TEST @ Mon Mar  8 11:33:32 2010

Member Status: Quorate

 Member Name                                                     ID   Status

 ------ ----                                                     ---- ------

 node1                                                            1 Offline

 node2                                                            2 Online, Local, rgmanager

 Service Name                                                     Owner (Last)

                                                  State

 ------- ----                                                     ----- ------

                                                  -----

 service:TEST                                                     node1

                                                 stopping

It's stopping like that since 30min !

Here is the log:

Mar  8 11:35:45 node2 fenced[7038]: agent "fence_rsa" reports: Unable to

connect/login to fencing device

Mar  8 11:35:45 node2 fenced[7038]: fence "node1" failed

Mar  8 11:35:50 node2 fenced[7038]: fencing node "node1"

Mar  8 11:35:56 node2 fenced[7038]: agent "fence_rsa" reports: Unable to

connect/login to fencing device

Mar  8 11:35:56 node2 fenced[7038]: fence "node1" failed

Why node2 is still trying to fence node1 ?

Here is something else :

[root@node2 ~]# cman_tool services

type             level name       id       state

fence            0     default    00010001 FAIL_START_WAIT

[2]

dlm              1     rgmanager  00020001 FAIL_ALL_STOPPED

[1 2]

How to verify quorum is used ?

Last question : I have 3 networks (6 nic, 3 bonding), one is dedicated for

heartbeat. where I have to set it in cluster.conf ? I would like node1 and node2

communicate by their own bond3 .

Thanks for your help.

mog

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster