So, what was happening was this: 1. unplug cable 2. cman transitions 3. fencing occurs 4. qdiskd detects negative transition Here's what we want on the "dead" node: 1. unplug cable 2. qdiskd detects negative transition from heuristic Here's your configuration: <quorumd interval="1" label="Qdisk1" tko="5" votes="1"> <heuristic interval="1" program="ping 10.200.10.1 -c1 -t1" score="1" tko="3"/> </quorumd> First, let's ping the router with the cable unplugged to see how long it takes for our heuristic to complete when things are "broken". On my machine: [lhh@ayanami ~]$ time ping -c1 -t1 frederick PING frederick (12.1.2.99) 56(84) bytes of data. >From ayanami (12.1.2.37) icmp_seq=1 Destination Host Unreachable --- frederick ping statistics --- 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms real 0m3.006s ^^^^^^^^^^^^^^^^ user 0m0.000s sys 0m0.000s Ok - so, 3 seconds for ping to "not find" a host if routing is wrong or the host is down, sleep 1 second, repeat 3x (tko!) - if dead 3x (tko count), qdisk removes the vote from CMAN. That means if the host is down, it will take qdisk about 3 * (3+1) = 12 seconds to kill its vote with CMAN. [NOTE: keep in mind, it might not be 3 seconds for your configuration...] CMAN's default failover time is 5 seconds (this is really openais's Totem protocol token timeout, if you want to be technical). 12 > 5, meaning qdiskd can't do much to help before CMAN takes action. We need to flip these times so that CMAN times out *after* qdisk. This way, qdiskd can say "Ok! I'm dead!" - and either take action (reboot by default) or remove its vote from CMAN. So, the practical rules for timings are basically like this: * Heuristics should transition before QDisk. x < y. * Qdisk should transition before CMAN - in a little less than 1/2 the time, actually. y * 2 < z Option 1: Make 1 tko sufficient by making the heuristic do more work. In my quick testing, the same 3 seconds for 1 packet was used for 3 packets. Also, we still want CMAN to time out after qdisk - which it won't yet. So, we need to add a tag to cluster.conf that instructs totem to report a node as down after a period longer than qdisk (a little more than double, as noted above): ... <quorumd ...> <heuristic interval="1" program="ping 10.200.10.1 -c3 -t1" score="1" tko="1"/> </quorumd> <totem token="11000"/> ... This says 110000 milliseconds, or 11 seconds, is required before totem (and therefore, CMAN) will declare a node dead (2 * qdisk_timeout) = 10. Toss in a second for fun, we get 11 seconds. Since the ping timeout for -c3 is 3 seconds and we have a tko of 1, it should take 3-4 seconds for ping to return a failure. 3 < 5 5 * 2 < 11 Option 2: Make things fit around your heuristic. Given our 12 second "negative" case for our heuristic/tko, we can simply make qdisk time out in >12 seconds. Then, we double that and add a bit for CMAN: ... <quorumd interval="1" label="Qdisk1" tko="13" votes="1"> <heuristic interval="1" program="ping 10.200.10.1 -c1 -t1" score="1" tko="3"/> </quorumd> <totem token="27000"/> ... 12 < 13 13 * 2 < 27 Let me know if this helps you, so I can add it to the Wiki and further clarify the manual pages. Either of these should get you up and working. -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster