Heartbeat time outs in rhel4 understanding

"Elias, Michael" <EliasM@xxxxxxx> · Tue, 5 May 2009 13:48:09 -0400

I am trying to understand how these timers interact with
each other.

In a RHEL4 cluster the heartbeat defaults are;

hello_timer:5

max_retries:5 

deadnode_timeout:21

Meaning a heartbeat message is sent every 5 seconds, if it
fails to receive a response it will start a deadnode counter @ 21 seconds. It
will also try to send 5 more heartbeat requests. What is the interval of those retries?
If none of those requests receive a response. 5 seconds pass.. there is 15
seconds left on the deadnode timer and we try upto 5 times to get a response….
This goes on until we hit the 4^th iteration of the hellotimer it tries
again upto 5 times and fails… we then hit the 21 second on the deadnode
time.. fenced takes over and wham reboot.

Is my understanding of this correct???? 

Thanks for any help..

Michael

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster