What does FAIL_STOP_WAIT state mean for clvmd and rgmanager

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Can someone please explain what this means and what you can do to get out of it:

[root@cluster-host ~]# group_tool -v
type             level name       id       state node id local_done
fence            0     default    00010003 JOIN_STOP_WAIT 1 100050001 1
[1 1 2 3 4]
dlm              1     clvmd      00020003 FAIL_STOP_WAIT 2 200030003 1
[1 2 3 4]
dlm              1     rgmanager  00030003 FAIL_STOP_WAIT 2 200030003 1
[1 2 3 4]

From reading:

http://www.mailinglistarchive.com/linux-cluster@xxxxxxxxxx/msg04478.html

and the associated page in Japanese my understanding is:

- one of the services, probably clvmd has failed (I have no idea why and can't find anything in the logs as to why)
- due to the nature of the services, once one hangs the rest sit in deadlock waiting for it to resume
- the only way to resolve this problem is to "xm destroy" the VM, set "chkconfig cman off" on all cluster nodes, reboot all nodes then start cman simultaneously. this fixes the problem but its fairly destructive and hackish

We are using:

CentOS 5.5 with latest patches for the most part, RHEL5.4 dom0 with xen and fence_xvmd running. That part is all working well. The clvmd disks are connected to SAN disks on an EMC disk array.

Weirdly I seem to have two versions of openais installed which seems odd

OpenAIS version : openais-0.80.6-16.el5_5.7.x86_64
CMAN version: cman-2.0.115-34.el5_5.3.x86_64

Thanks

Joel
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux