Re: Problem with ping as an heuristic with qdiskd

emmanuel segura <emi2fast@xxxxxxxxx> · Fri, 9 Mar 2012 15:39:43 +0100

Hello Gianluca

Do you have a cluster private network?

if your answer it's yes i recommend don't use heuristic because if your cluster public network goes down your cluster take a fencing loop

Or you can do something better, use pacemaker+corosync

Il giorno 09 marzo 2012 15:14, Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> ha scritto:

Hello,

I have a cluster in RH EL 5.7 with quorum disk and an heuristic.

Current versions of main cluster packages are:

rgmanager-2.0.52-21.el5_7.1

cman-2.0.115-85.el5_7.3

This is the loaded heuristic

Heuristic: 'ping -c1 -w1 10.4.5.250' score=1 interval=2 tko=200

Line in cluster.conf:

<heuristic interval="2" program="ping -c1 -w1 10.4.5.250" score="1" tko="200"/>

where 10.4.5.250 is the gateway of the production lan,

>From ping man page:

 -c count

 Stop after sending count ECHO_REQUEST packets. With deadline (-w)

option,  ping  waits  for count ECHO_REPLY packets, until the timeout

expires.

-w deadline

 Specify a timeout, in seconds, before ping exits regardless of how many

packets have  been  sent or  received.  In  this case ping does not stop

after count packet are sent, it waits either for deadline expire or

until count probes are answered or for some error notification from

network.

So I would expect that the single ping command, executed as a sanity

check, at most after 1 second

should exit with a code, regardless an echo reply has been received or not

And in fact I had no particular problem for many months

As a test, putting an ip on an unreachable lan (say 10.4.6.5):

date

n=0

while [ $n -lt 20 ]

do

  ping -c1 -w1 10.4.6.5

  sleep 2

  n=$(expr $n + 1)

done

date

Output is

Fri Mar  9 11:59:02 CET 2012

PING 10.4.6.5 (10.4.6.5) 56(84) bytes of data.

--- 10.4.6.5 ping statistics ---

2 packets transmitted, 0 received, 100% packet loss, time 1000ms

...

--- 10.4.6.5 ping statistics ---

2 packets transmitted, 0 received, 100% packet loss, time 999ms

Fri Mar  9 12:00:02 CET 2012

so 60 seconds....

In case of gateway reachability problems (also tested with an iptables

rule that drops icmp output request) I would then have:

qdiskd[2780]: <debug> Heuristic: 'ping -c1 -w1 10.4.5.250' missed

(1/200)

Strange thing I got yesterday night was this only line:

qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -

Exceeded timeout of 75 seconds

and the node self-fencing causing relocation of some services

So for some reason the ping command was not able to exit at all, I presume...

despite the -c and -w options....

I suppose a condition that causes an internal timeout defined for the

monitor operation itself (default to 75 seconds?)

something like a pacemaker directive

op monitor interval="20" timeout="40"

And the cluster at this point considering as heuristic failed at all

and self-fencing....

Is this right?

My default quorumd directive is this one, btw:

<quorumd device="/dev/mapper/mpquorum" interval="5" label="oraprquorum"

log_facility="local4" log_level="7" tko="16" votes="1">

And in fact when for some reason I have temporary problems with my

SAN, I get something like:

qdiskd[1339]: <warning> qdisk cycle took more than 5 seconds to complete

(34.540000)

and on the other node

qdiskd[6025]: <debug> Node 1 missed an update (2/200)

qdiskd[6025]: <debug> Node 1 missed an update (3/200)

...

Can anyone give any insight for the message I got yesterday that I

never saw before:

qdiskd[22145]: <info> Heuristic: 'ping -c1 -w1 10.4.5.250' DOWN -

Exceeded timeout of 75 seconds

?

Do I have to suppose a bug in the ping command?

Thanks in advance,

Gianluca

--

Linux-cluster mailing list

Linux-cluster@xxxxxxxxxx

https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
esta es mi vida e me la vivo hasta que dios quiera

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster