cman + qdisk timeouts....

"Moralejo, Alfredo" <alfredo.moralejo@xxxxxxxxx> · Mon, 15 Jun 2009 16:17:55 +0200

Hi,

I’m having what I think is a timeouts issue in my
cluster.

I have a two node cluster using qdisk. Everytime the node
that has the master role for qdisk becomes down (for failure or even stopping
qdiskd manually), packages in the sane node are stopped because of the lack of
quorum as the qdiskd becames unresponsive until second node becames master node
and start working properly. Once qdiskd start working fine (usually 5-6 seconds)
packages are started again. 

I’ve read in the cluster manual section for “CMAN membership timeout value” and I think this
is the case. I’ve used RHEL 5.3 and I thought this parameter is the token
that I set much longer that needed:

<cluster alias="CLUSTER_ENG"
config_version="75" name="CLUSTER_ENG">

        <totem token="50000"/>

…

        <quorumd
device="/dev/mapper/mpathquorump1" interval="3"
status_file="/tmp/qdisk" tko="3" votes="5"
log_level="7" log_facility="local4"/>

Totem token is much more that double of qdisk timeout, so I
guess it should be enough but everytime qdisk dies in the master node I get
same result, services restarted in the sane node:

Jun 15 16:11:33 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (2/3)

Jun 15 16:11:38 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (3/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (4/3)

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug>
Node 1 DOWN

Jun 15 16:11:43 rmamseslab07 qdiskd[14130]: <debug>
Making bid for master

Jun 15 16:11:44 rmamseslab07 clurgmgrd: [18510]:
<info> Executing /etc/init.d/watchdog status

Jun 15 16:11:48 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (5/3)

Jun 15 16:11:53 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (6/3)

Jun 15 16:11:53
rmamseslab07 qdiskd[14130]: <info> Assuming master role

Message from syslogd@rmamseslab07 at Jun 15
16:11:53 ...

 clurgmgrd[18510]: <emerg> #1: Quorum
Dissolved

Jun 15 16:11:53 rmamseslab07
openais[14087]: [CMAN ] lost contact with quorum device

Jun 15 16:11:53 rmamseslab07
openais[14087]: [CMAN ] quorum lost, blocking activity

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]:
<debug> Membership Change Event

Jun 15 16:11:53
rmamseslab07 clurgmgrd[18510]: <emerg> #1: Quorum Dissolved

Jun 15 16:11:53 rmamseslab07
clurgmgrd[18510]: <debug> Emergency stop of service:Cluster_test_2

Jun 15 16:11:53 rmamseslab07 clurgmgrd[18510]:
<debug> Emergency stop of service:wdtcscript-rmamseslab05-ic

Jun 15 16:11:53 rmamseslab07
clurgmgrd[18510]: <debug> Emergency stop of
service:wdtcscript-rmamseslab07-ic

Jun 15 16:11:54 rmamseslab07
clurgmgrd[18510]: <debug> Emergency stop of service:Logical volume 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug>
Node 1 missed an update (7/3)

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <notice>
Writing eviction notice for node 1

Jun 15 16:11:58 rmamseslab07 qdiskd[14130]: <debug>
Telling CMAN to kill the node

Jun 15 16:11:58
rmamseslab07 openais[14087]: [CMAN ] quorum regained, resuming activity

I’ve just logged a case but… any idea????

Regards,

Alfredo Moralejo 

Business
Platforms Engineering - OS Servers - UNIX Senior Specialist

F. Hoffmann-La Roche Ltd.

Global Informatics Group Infrastructure

Josefa Valcárcel, 40

28027 Madrid SPAIN

Phone: +34 91 305 97 87 

alfredo.moralejo@xxxxxxxxx

Confidentiality Note:
This message is intended only for the use of the named recipient(s) and may
contain confidential and/or proprietary information. If you are not the
intended recipient, please contact the sender and delete this message. Any
unauthorized use of the information contained in this message is prohibited. 

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster