On Thu, 2008-02-07 at 15:32 +0000, gordan@xxxxxxxxxx wrote: > Hi, > > I've got a slightly peculiar problem. 2-node cluster acting as a load > balanced fail-over router. 3 NICs: public, private, cluster. > Cluster NICs are connected with a cross-over cable, the other two are on > switches. The cluster NIC is only used for DRBD/GFS/DLM and associated > things. > > The failure mode that I'm trying to account for is the one of the cluster > NIC failing on one machine. On the public and privace networks, both > machines can still see everything (including each other). That means that > a tie-breaker based on other visible things will not work. > > So, which machine gets fenced in the case of the cluster NIC failure (or > more likely, if the x-over cable falls out? ... whichever gets fenced first ;) 1. You can do a clever heuristic using qdiskd if you wanted, for example: * assign an IP on the private cluster network and make rgmanager manage it as a service (even though it doesn't do anything). Make sure to *disable* monitor_link, or rgmanager will stop the service! * make a script to check for: * ethernet link of the private interface, and * if that fails, ping the service IP address * if that fails, we are *dead*; give up and -do not- try to fence If you put the IP as part of the "most critical" service that rgmanager's running, then the operator of that service will continue running while the other node is not allowed to continue running. Because the first check is whether we have a cluster link - and the *second* check is the ping of the service, Something like this... <quorumd ...> <heuristic program="/usr/local/sbin/private-link-script" ... /> </quorumd> <rm> ... <service name="pinger-ip" > <ip address="10.1.1.2" monitor_link="no"/> </service> ... </rm> Script might something like: #!/bin/sh DEVICE=eth3 PINGIP=10.1.1.2 # # Ensure the device is there! # ethtool $DEVICE || exit 1 # # Check for link # ethtool $DEVICE | grep -q "Link detected.*yes" if [ $? -eq 0 ]; then exit 0 fi # # XXX Work around signal bug for now. # ping_func() { declare retries=0 declare PID ping -c3 -t2 $1 & PID=`jobs -p` while [ $retries -lt 2 ]; do sleep 1 ((retries++)) kill -n 0 $PID &> /dev/null if [ $? -eq 1 ]; then wait $PID return $? fi done kill -9 $PID return 1 } # # Ping service ip address. # ping_func $1 exit $? ------------------------------- Disadvantage is that it's hard to start the cluster w/o both nodes online without some sort of override. 2. You can do something like Brian said, too - e.g. "if I am the right host and the link isn't up, I win": #!/bin/sh DEVICE=eth3 OTHER_NODE_PUBLIC_IP="192.168.1.2" # # Ensure the device is there! # ethtool $DEVICE || exit 1 # # Check for link # ethtool $DEVICE | grep -q "Link detected.*yes" if [ $? -eq 0 ]; then exit 0 fi # # XXX Work around signal bug for now. # ping_func() { declare retries=0 declare PID ping -c3 -t2 $1 & PID=`jobs -p` while [ $retries -lt 2 ]; do sleep 1 ((retries++)) kill -n 0 $PID &> /dev/null if [ $? -eq 1 ]; then wait $PID return $? fi done kill -9 $PID return 1 } # # Ok, no link on private net # ping_func $OTHER_NODE_PUBLIC_IP if [ $? -eq 0 ]; then [ "`uname -n`" == "node1" ] exit $? fi # # Other node is down and we're not - # we win # exit 0 ------------------------------- 3. Another simple way to do it is to use a fake "fencing agent" to introduce a delay: <fencedevice agent="/bin/sleep-10" name="sleeper" .../> (where /bin/sleep-10 is something like: #!/bin/sh sleep 10 exit 0 ) Reference that agent as part of -one- node's fencing, and that node will lose by default. This way, you don't have to set up qdiskd. You could do the same thing by just editing the fencing agent directly on that node, as well - in which case, you wouldn't have to edit cluster.conf at all. -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster