Hi, I have a two node cluster providing NFS. I'm using a small partition on the shared storage as a quorum disk, with a single heuristic to ping the default gateway on the network. Both nodes are connected to the network with bonded interfaces, but I have all of the heartbeat/cluster traffic running over a crossover cable between the two. The hardware is a pair of Dell PowerEdge servers with an MD3000 array between them and I'm using the DRAC interface as the fence device. My cluster.conf looks like the following: <cluster name="storage" config_version="46"> <fence_daemon post_fail_delay="0" post_join_delay="3"/> <quorumd interval="1" tko="10" votes="1" device="/dev/mapper/md3000p1"> <heuristic program="ping 192.168.30.254 -c3 -t2" score="1" interval="2" tko="3"/> </quorumd> <clusternodes> <clusternode name="node1-xover" votes="1" nodeid="1"> <fence> ... </fence> </clusternode> <clusternode name="node2-xover" votes="1" nodeid="2"> <fence> ... </fence> </clusternode> </clusternodes> <cman expected_votes="3"/> ... </cluster> (It's exactly as per the qdisk(5) man page example) It's been up and running for ages with no trouble, but recently I had a problem where the default gateway, despite being an active/passive pair of Cisco ASA firewalls configured for failover, took at least 30 seconds to fail over when the primary device developed a problem. This caused the heuristic to fail for long enough, and both nodes rebooted simultaneously which caused a loss of service. All I can see in the logs is: ---8<--- Mar 5 07:38:50 node1 qdiskd[7967]: <info> Heuristic: 'ping 192.168.30.254 -c3 -t2' DOWN (3/3) Mar 5 07:38:50 node1 qdiskd[7967]: <notice> Score insufficient for master oper ation (0/1; required=1); downgrading Mar 5 07:38:50 node1 kernel: md: stopping all md devices. Mar 5 07:38:51 node1 kernel: Synchronizing SCSI cache for disk sdd: Mar 5 07:38:51 node1 kernel: Synchronizing SCSI cache for disk sdb: Mar 5 07:38:51 node1 kernel: Synchronizing SCSI cache for disk sda: Mar 5 07:38:51 node1 kernel: ACPI: PCI interrupt for device 0000:0a:00.0 disab led Mar 5 07:38:51 node1 kernel: hub 1-1:1.0: cannot reset port 2 (err = -71) Mar 5 07:42:34 node1 syslogd 1.4.1: restart. ---8<--- The time difference between the last two messages is obviously where the node is rebooting. The timestamps on the logs from both nodes are identical apart from a few seconds on that last message. I'm a bit unsure what actually did the rebooting in this case, was it qdiskd or each node shooting the other? Ideally I would like to prevent this situation from happening again, is it a case of simply adding reboot="0" to the <quorumd> directive? Does this introduce any different problems? Thanks Matt
Attachment:
pgplKXpS8TWbR.pgp
Description: PGP signature
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster