Hello, I have a cluster in RH EL 5.7 with quorum disk and an heuristic. Current versions of main cluster packages are: rgmanager-2.0.52-21.el5_7.1 cman-2.0.115-85.el5_7.3 This is the loaded heuristic Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200 Line in cluster.conf: <heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" tko="200"/> where 10.4.5.250 is the gateway of the production lan, >From ping man page: -c count Stop after sending count ECHO_REQUEST packets. With deadline (-w) option, ping waits for count ECHO_REPLY packets, until the timeout expires. -w deadline Specify a timeout, in seconds, before ping exits regardless of how many packets have been sent or received. In this case ping does not stop after count packet are sent, it waits either for deadline expire or until count probes are answered or for some error notification from network. So I would expect that the single ping command, executed as a sanity check, at most after 1 second should exit with a code, regardless an echo reply has been received or not And in fact I had no particular problem for many months As a test, putting an ip on an unreachable lan (say 10.4.6.5): date n=0 while [ $n -lt 20 ] do ping -c1 -w1 10.4.6.5 sleep 2 n=$(expr $n + 1) done date Output is Fri Mar 9 11:59:02 CET 2012 PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data. --- 10.4.6.5 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 1000ms ... --- 10.4.6.5 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Fri Mar 9 12:00:02 CET 2012 so 60 seconds.... In case of gateway reachability problems (also tested with an iptables rule that drops icmp output request) I would then have: qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed (1/200) Strange thing I got yesterday night was this only line: qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN - Exceeded timeout of 75 seconds and the node self-fencing causing relocation of some services So for some reason the ping command was not able to exit at all, I presume... despite the -c and -w options.... I suppose a condition that causes an internal timeout defined for the monitor operation itself (default to 75 seconds?) something like a pacemaker directive op monitor interval="20" timeout="40" And the cluster at this point considering as heuristic failed at all and self-fencing.... Is this right? My default quorumd directive is this one, btw: <quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum" log_facility="local4" log_level="7" tko="16" votes="1"> And in fact when for some reason I have temporary problems with my SAN, I get something like: qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete (34.540000) and on the other node qdiskd[6025]: <debug> Node 1 missed an update (2/200) qdiskd[6025]: <debug> Node 1 missed an update (3/200) ... Can anyone give any insight for the message I got yesterday that I never saw before: qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN - Exceeded timeout of 75 seconds ? Do I have to suppose a bug in the ping command? Thanks in advance, Gianluca -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster