Hi List, I am currently testing Redhat Cluster Suite for a number of two node clusters accessing EMC storage systems. Everything seems to be running fine expect for qdisk. On Friday we had a network problem during which the nodes were still able to see each other but none of the addresses used in my heuristics for qdisk. The result was not what I expected, when the network came back, both nodes claimed to be master. See below the quorumd part of my cluster.conf <snip> <quorumd interval="1" tko="10" votes="3" log_level="9" log_facility="local4" status_file="/qdisk_status" min_score="3" device="/dev/emcpowerk1"> <heuristic program="ping 172.23.4.254 -c1 -t1" score="2" interval="2"/> <heuristic program="ping 130.246.8.13 -c1 -t3" score="1" interval="2"/> <heuristic program="ping 130.246.72.21 -c1 -t3" score="1" interval="2"/> <heuristic program="ping 172.23.5.120 -c1 -t1" score="2" interval="2"/> </quorumd> </snip> /qdisk_status on one node while everything seems to be running fine: <snip> Node ID: 2 Score (current / min req. / max allowed): 6 / 3 / 6 Current state: Running Current disk state: None Visible Set: { 1 2 } Master Node ID: 1 Quorate Set: { 1 2 } </snip> After a "/etc/init.d/qdiskd restart" I find the following in the log files: (logs fine to me...) Dec 18 10:50:40 duoserv2 qdiskd[24304]: <info> Quorum Daemon Initializing Dec 18 10:50:40 duoserv2 qdiskd: Starting the Quorum Disk Daemon: succeeded Dec 18 10:50:47 duoserv2 qdiskd[24304]: <info> Node 1 is the master Dec 18 10:50:50 duoserv2 qdiskd[24304]: <info> Initial score 6/6 Dec 18 10:50:50 duoserv2 qdiskd[24304]: <info> Initialization complete And finally during the network issue last week I found the following log entries: Dec 15 09:53:48 duoserv2 qdiskd[31393]: <info> Node 1 shutdown Dec 15 09:53:48 duoserv2 qdiskd[31393]: <notice> Score insufficient for master operation (0/3; max=6); downgrading Dec 15 09:53:48 duoserv2 clurgmgrd[7950]: <emerg> #1: Quorum Dissolved Dec 15 09:53:48 duoserv2 kernel: CMAN: quorum lost, blocking activity Dec 15 09:53:48 duoserv2 ccsd[5595]: Cluster is not quorate. Refusing connection. Dec 15 09:53:48 duoserv2 ccsd[5595]: Error while processing connect: Connection refused Dec 15 09:53:48 duoserv2 ccsd[5595]: Invalid descriptor specified (-111). Dec 15 09:53:48 duoserv2 ccsd[5595]: Someone may be attempting something evil. Dec 15 09:53:48 duoserv2 ccsd[5595]: Error while processing get: Invalid request descriptor And later when the network came back: Dec 15 10:31:45 duoserv2 qdiskd[31393]: <notice> Score sufficient for master operation (6/3; max=6); upgrading Dec 15 10:31:46 duoserv2 qdiskd[31393]: <info> Assuming master role Dec 15 10:31:47 duoserv2 kernel: CMAN: quorum regained, resuming activity Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <notice> Quorum Achieved Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Magma Event: Membership Change Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: Local UP Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> State change: duoserv1 UP Dec 15 10:31:47 duoserv2 clurgmgrd[7950]: <info> Loading Service Data Dec 15 10:31:47 duoserv2 ccsd[5595]: Cluster is quorate. Allowing connections. Dec 15 10:31:50 duoserv2 clurgmgrd: [7950]: <info> /dev/mapper/logs1-logs1 is not mounted Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> Critical Error: More than one master found! Dec 15 10:31:51 duoserv2 qdiskd[31393]: <crit> A master exists, but it's not me?! Dec 15 10:31:52 duoserv2 qdiskd[31393]: <info> Node 1 is the master ... At the same time on the second node: Dec 15 10:31:45 duoserv1 qdiskd[316]: <notice> Score sufficient for master operation (5/3; max=6); upgrading Dec 15 10:31:46 duoserv1 qdiskd[316]: <info> Assuming master role Dec 15 10:31:47 duoserv1 kernel: CMAN: quorum regained, resuming activity Dec 15 10:31:47 duoserv1 ccsd[5624]: Cluster is quorate. Allowing connections. Dec 15 10:31:47 duoserv1 clurgmgrd[3631]: <notice> Quorum Achieved Dec 15 10:31:51 duoserv1 qdiskd[316]: <crit> Critical Error: More than one master found! Dec 15 10:31:52 duoserv1 qdiskd[316]: <info> Node 2 is the master Dec 15 10:31:52 duoserv1 qdiskd[316]: <crit> Critical Error: More than one master found! ... This continues until I finally notice and restart qdiskd on both nodes, when they agree on one master again. I have the following packages installed on both nodes ccs-1.0.7-0 rgmanager-1.9.54-1 lvm2-cluster-2.02.01-1.2.RHEL4 cman-1.0.11-0 cman-kernel-smp-2.6.9-43.8.5 fence-1.32.25-1 cman-kernel-smp-2.6.9-45.8 The running kernel is: 2.6.9-42.0.3.ELsmp Does anyone have any idea what I could do to avoid this situation in the future? If I can provide any more information, please ask. Many thanks, Frederik -- Frederik Ferner Systems Administrator Phone: +44 (0)1235-778624 Diamond Light Source Fax: +44 (0)1235-778468 -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster