Dear all,
we have a RHEL 6.3 cluster of two nodes and a quorum disk.
We are testing the cluster against different failures. We have a problem
when the shared storage is disconnected from one of the nodes. The node
that has lost contact with the storage is fenced, but when restarting
the machine cman will not start up (it will try to start but it will stop):
Jul 9 17:55:54 clnode1p kdump: started up
Jul 9 17:55:54 clnode1p kernel: bond0: no IPv6 routers present
Jul 9 17:55:54 clnode1p kernel: DLM (built Jun 13 2012 18:26:45) installed
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Corosync Cluster
Engine ('1.4.1'): started and ready to provide service.
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Corosync built-in
features: nss dbus rdma snmp
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Successfully read
config from /etc/cluster/cluster.conf
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Successfully parsed
cman config
Jul 9 17:55:55 clnode1p corosync[2514]: [TOTEM ] Initializing
transport (UDP/IP Multicast).
Jul 9 17:55:55 clnode1p corosync[2514]: [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 9 17:55:55 clnode1p corosync[2514]: [TOTEM ] The network
interface [172.16.255.1] is now up.
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Using quorum
provider quorum_cman
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync cluster quorum service v0.1
Jul 9 17:55:55 clnode1p corosync[2514]: [CMAN ] CMAN 3.0.12.1 (built
May 8 2012 12:22:26) started
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync CMAN membership service 2.90
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: openais checkpoint service B.01.01
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync extended virtual synchrony service
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync configuration service
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync cluster closed process group service v1.01
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync cluster config database access v1.01
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync profile loading service
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Using quorum
provider quorum_cman
Jul 9 17:55:55 clnode1p corosync[2514]: [SERV ] Service engine
loaded: corosync cluster quorum service v0.1
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Compatibility mode
set to whitetank. Using V1 and V2 of the synchronization engine.
Jul 9 17:55:55 clnode1p corosync[2514]: [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Members[1]: 1
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Members[1]: 1
Jul 9 17:55:55 clnode1p corosync[2514]: [CPG ] chosen downlist:
sender r(0) ip(172.16.255.1) ; members(old:0 left:0)
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Completed service
synchronization, ready to provide service.
Jul 9 17:55:55 clnode1p corosync[2514]: [TOTEM ] A processor joined
or left the membership and a new membership was formed.
Jul 9 17:55:55 clnode1p corosync[2514]: [CMAN ] quorum regained,
resuming activity
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] This node is within
the primary component and will provide service.
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Members[2]: 1 2
Jul 9 17:55:55 clnode1p corosync[2514]: [QUORUM] Members[2]: 1 2
Jul 9 17:55:55 clnode1p corosync[2514]: [CPG ] chosen downlist:
sender r(0) ip(172.16.255.1) ; members(old:1 left:0)
Jul 9 17:55:55 clnode1p corosync[2514]: [MAIN ] Completed service
synchronization, ready to provide service.
Jul 9 17:55:59 clnode1p kernel: bond1: no IPv6 routers present
Jul 9 17:55:59 clnode1p qdiskd[2564]: Loading dynamic configuration
Jul 9 17:55:59 clnode1p qdiskd[2564]: Setting votes to 1
Jul 9 17:55:59 clnode1p qdiskd[2564]: Loading static configuration
Jul 9 17:55:59 clnode1p qdiskd[2564]: Timings: 8 tko, 1 interval
Jul 9 17:55:59 clnode1p qdiskd[2564]: Timings: 2 tko_up, 4 master_wait,
2 upgrade_wait
Jul 9 17:55:59 clnode1p qdiskd[2564]: Heuristic: '/bin/ping -c1 -w1
clswitch1m' score=1 interval=2 tko=4
Jul 9 17:55:59 clnode1p qdiskd[2564]: Heuristic: '/bin/ping -c1 -w1
clswitch2m' score=1 interval=2 tko=4
Jul 9 17:55:59 clnode1p qdiskd[2564]: 2 heuristics loaded
Jul 9 17:55:59 clnode1p qdiskd[2564]: Quorum Daemon: 2 heuristics, 1
interval, 8 tko, 1 votes
Jul 9 17:55:59 clnode1p qdiskd[2564]: Run Flags: 00000271
Jul 9 17:55:59 clnode1p qdiskd[2564]: stat
Jul 9 17:55:59 clnode1p qdiskd[2564]: qdisk_validate: No such file or
directory
Jul 9 17:55:59 clnode1p qdiskd[2564]: Specified partition
/dev/mapper/apsto1-vd01-v001 does not have a qdisk label
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Unloading all
Corosync service engines.
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync extended virtual synchrony service
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync configuration service
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync cluster closed process group service v1.01
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync cluster config database access v1.01
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync profile loading service
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: openais checkpoint service B.01.01
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync CMAN membership service 2.90
Jul 9 17:56:01 clnode1p corosync[2514]: [SERV ] Service engine
unloaded: corosync cluster quorum service v0.1
Jul 9 17:56:01 clnode1p corosync[2514]: [MAIN ] Corosync Cluster
Engine exiting with status 0 at main.c:1864.
And it will remain in this state even if the storage is reattached later
on. So now I have only one functioning node.
What can be done to fix this (to have the cluster framework started)?
Thank you,
Laszlo
--
Acceleris System Integration | and IT works
Laszlo Budai | Technical Consultant
Bvd. Barbu Vacarescu 80 | RO-020282 Bucuresti
t +40 21 23 11 538
laszlo.budai@xxxxxxxxxxxx | www.acceleris.ro
Acceleris Offices are in:
Basel | Bucharest | Zollikofen | Renens | Kloten
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster