Do you have a cluster private network?
if your answer it's yes i recommend don't use heuristic because if your cluster public network goes down your cluster take a fencing loop
Or you can do something better, use pacemaker+corosync
Il giorno 09 marzo 2012 15:14, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> ha scritto:
Hello,
I have a cluster in RH EL 5.7 with quorum disk and an heuristic.
Current versions of main cluster packages are:
rgmanager-2.0.52-21.el5_7.1
cman-2.0.115-85.el5_7.3
This is the loaded heuristic
Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200
Line in cluster.conf:
<heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" tko="200"/>
where 10.4.5.250 is the gateway of the production lan,
>From ping man page:
-c count
Stop after sending count ECHO_REQUEST packets. With deadline (-w)
option, ping waits for count ECHO_REPLY packets, until the timeout
expires.
-w deadline
Specify a timeout, in seconds, before ping exits regardless of how many
packets have been sent or received. In this case ping does not stop
after count packet are sent, it waits either for deadline expire or
until count probes are answered or for some error notification from
network.
So I would expect that the single ping command, executed as a sanity
check, at most after 1 second
should exit with a code, regardless an echo reply has been received or not
And in fact I had no particular problem for many months
As a test, putting an ip on an unreachable lan (say 10.4.6.5):
date
n=0
while [ $n -lt 20 ]
do
ping -c1 -w1 10.4.6.5
sleep 2
n=$(expr $n + 1)
done
date
Output is
Fri Mar 9 11:59:02 CET 2012
PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data.
--- 10.4.6.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1000ms
...
--- 10.4.6.5 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 999ms
Fri Mar 9 12:00:02 CET 2012
so 60 seconds....
In case of gateway reachability problems (also tested with an iptables
rule that drops icmp output request) I would then have:
qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed
(1/200)
Strange thing I got yesterday night was this only line:
qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
Exceeded timeout of 75 seconds
and the node self-fencing causing relocation of some services
So for some reason the ping command was not able to exit at all, I presume...
despite the -c and -w options....
I suppose a condition that causes an internal timeout defined for the
monitor operation itself (default to 75 seconds?)
something like a pacemaker directive
op monitor interval="20" timeout="40"
And the cluster at this point considering as heuristic failed at all
and self-fencing....
Is this right?
My default quorumd directive is this one, btw:
<quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum"
log_facility="local4" log_level="7" tko="16" votes="1">
And in fact when for some reason I have temporary problems with my
SAN, I get something like:
qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete
(34.540000)
and on the other node
qdiskd[6025]: <debug> Node 1 missed an update (2/200)
qdiskd[6025]: <debug> Node 1 missed an update (3/200)
...
Can anyone give any insight for the message I got yesterday that I
never saw before:
qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -
Exceeded timeout of 75 seconds
?
Do I have to suppose a bug in the ping command?
Thanks in advance,
Gianluca
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster
--
esta es mi vida e me la vivo hasta que dios quiera
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster