Elias, Michael wrote: > I am trying to understand how these timers interact with each other. > > > > In a RHEL4 cluster the heartbeat defaults are; > > hello_timer:5 > > max_retries:5 > > deadnode_timeout:21 > > > > Meaning a heartbeat message is sent every 5 seconds, if it fails to > receive a response it will start a deadnode counter @ 21 seconds. It > will also try to send 5 more heartbeat requests. What is the interval of > those retries? If none of those requests receive a response. 5 seconds > pass.. there is 15 seconds left on the deadnode timer and we try upto 5 > times to get a response…. This goes on until we hit the 4^th iteration > of the hellotimer it tries again upto 5 times and fails… we then hit the > 21 second on the deadnode time.. fenced takes over and wham reboot. > > > > Is my understanding of this correct???? > No, I'm afraid it isn't :-) max_retries has nothing to do with the heartbeat. It is to do with cluster messages, such as service join requests, clvmd messages or the messages used in the membership protocol. So the heartbeat system is just a 5 second heartbeat and after 21 seconds the node will be evicted from the cluster and (usually) fenced. The same happens for data messages if max_retries is exceeded. The retry period here starts at 1 second and increases each time to avoid filling the ethernet buffers. I hope this helps, Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster