RE: cman + qdisk timeouts....

"Moralejo, Alfredo" <alfredo.moralejo@xxxxxxxxx> · Wed, 8 Jul 2009 10:40:21 +0200

Hi,

I added a heuristic checking network
status and help in network failure scenarios.

However, I still face the same problem as
soon as I stop the services orderly in the node holding the qdisk master role
or reboot it.

If I execute in master qdisk node:

# service rgmanager stop

# service clvmd stop

# service qdiskd stop

# service cman stop

As said by Red Hat, I get the quorum lost
in the other node until get the master role (some seconds) and stop the
services.

I’m managing that by adding a sleep after
stopping qdiskd long enough for the other node to become master, and then stop
cman.

I understand this is a bug.

My cluster.conf file:

<?xml version="1.0"?>

<cluster alias="clueng"
config_version="13" name="clueng">

        <fence_daemon
clean_start="1" post_fail_delay="0"
post_join_delay="10"/>

        <clusternodes>

                <clusternode
name="rmamseslab05" nodeid="1" votes="1">

                        <fence>

                                <method
name="1">

<device name="iLO_NODE1"/>

                                </method>

                                <method
name="2">

<device name="manual_fencing"
nodename="rmamseslab05"/>

</method>

                        </fence>

                </clusternode>

                <clusternode
name="rmamseslab07" nodeid="2" votes="1">

                        <fence>

                                <method
name="1">

<device name="iLO_NODE2"/>

                                </method>

                                <method
name="2">

<device name="manual_fencing"
nodename="rmamseslab07"/>

</method>

                        </fence>

                </clusternode>

        </clusternodes>

        <cman/>

        <totem
token="45000"/>

        <quorumd
device="/dev/mapper/mpathquorump1" interval="5"
status_file="/tmp/qdisk" tko="3" votes="1">

                <heuristic
program="/usr/local/cmcluster/conf/admin/test_hb.sh"
score="1" interval="3"/>

        </quorumd>

        <fencedevices>

                <fencedevice
agent="fence_manual" name="manual_fencing"/>

                <fencedevice agent="fence_ilo"
hostname="rbrmamseslab05" login="LANO"
name="iLO_NODE1" passwd="**"/>

                <fencedevice
agent="fence_ilo" hostname="rbrmamseslab07"
login="LANO" name="iLO_NODE2" passwd="**"/>

        </fencedevices>

        <rm>

                <!-- Configuration of
the resource group manager -->

                <failoverdomains>

                </failoverdomains>

                <service
autostart="1" exclusive="0" max_restarts="1"
name="pkg_test" recovery="restart"
restart_expire_time="900">

                        <script
file="/etc/cluster/pkg_test/startstop.sh"
name="pkg_test"/>

                </service>

                <resources>

                    <nfsexport
name="nfs_export"/>

                </resources>

        </rm>

</cluster>

Best regards,

Alfredo

From:
linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Juan Ramon Martin Blanco

Sent: Tuesday, July 07, 2009 12:21
PM

To: linux
 clustering

Subject: Re:  cman
+ qdisk timeouts....

On Mon, Jun 15, 2009 at 4:17 PM, Moralejo, Alfredo <alfredo.moralejo@xxxxxxxxx>
wrote:

Hi,

I’m
having what I think is a timeouts issue in my cluster.

I
have a two node cluster using qdisk. Everytime the node that has the master
role for qdisk becomes down (for failure or even stopping qdiskd manually),
packages in the sane node are stopped because of the lack of quorum as the
qdiskd becames unresponsive until second node becames master node and start
working properly. Once qdiskd start working fine (usually 5-6 seconds) packages
are started again. 

I’ve
read in the cluster manual section for “CMAN
membership timeout value” and I think this is the case. I’ve used RHEL 5.3 and
I thought this parameter is the token that I set much longer that needed:

<cluster
alias="CLUSTER_ENG" config_version="75"
name="CLUSTER_ENG">

<totem token="50000"/>

…

<quorumd device="/dev/mapper/mpathquorump1" interval="3"
status_file="/tmp/qdisk" tko="3" votes="5"
log_level="7" log_facility="local4"/>

Totem
token is much more that double of qdisk timeout, so I guess it should be enough
but everytime qdisk dies in the master node I get same result, services restarted
in the sane node:

Jun
15 16:11:33 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(2/3)

Jun
15 16:11:38 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(3/3)

Jun
15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(4/3)

Jun
15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Node 1 DOWN

Jun
15 16:11:43 rmamseslab07 qdiskd[14130]: <debug> Making bid for master

Jun
15 16:11:44 rmamseslab07 clurgmgrd: [18510]: <info> Executing
/etc/init.d/watchdog status

Jun
15 16:11:48 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(5/3)

Jun
15 16:11:53 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(6/3)

Jun 15 16:11:53 rmamseslab07
qdiskd[14130]: <info> Assuming master role

Message from syslogd@rmamseslab07 at Jun 15 16:11:53 ...

 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] lost
contact with quorum device

Jun 15 16:11:53 rmamseslab07 openais[14087]: [CMAN ] quorum
lost, blocking activity

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug>
Membership Change Event

Jun 15 16:11:53 rmamseslab07
clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug>
Emergency stop of service:Cluster_test_2

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug>
Emergency stop of service:wdtcscript-rmamseslab05-ic

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]: <debug>
Emergency stop of service:wdtcscript-rmamseslab07-ic

Jun 15 16:11:54 rmamseslab07 clurgmgrd[18510]: <debug>
Emergency stop of service:Logical volume 1

Jun
15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Node 1 missed an update
(7/3)

Jun
15 16:11:58 rmamseslab07 qdiskd[14130]: <notice> Writing eviction notice
for node 1

Jun
15 16:11:58 rmamseslab07 qdiskd[14130]: <debug> Telling CMAN to kill the
node

Jun 15 16:11:58 rmamseslab07
openais[14087]: [CMAN ] quorum regained, resuming activity

I’ve
just logged a case but… any idea????

Regards,

Hi!

Have you set two_node="0" in cman section?

Why don't you use any heuristics within the quorumd configuration? I.e: pinging
a router...

Could you paste us your cluster.conf?

Greetings,

Juanra

Alfredo Moralejo 

Business
Platforms Engineering - OS Servers - UNIX Senior Specialist

F. Hoffmann-La Roche Ltd.

Global
Informatics Group Infrastructure

Josefa Valcárcel, 40

28027 Madrid SPAIN

Phone: +34 91 305 97 87 

alfredo.moralejo@xxxxxxxxx

Confidentiality Note:
This message is intended only for the use of the named recipient(s) and may
contain confidential and/or proprietary information. If you are not the
intended recipient, please contact the sender and delete this message. Any
unauthorized use of the information contained in this message is
prohibited. 

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster