Hi,
I have some questions on configuring and tuning heartbeats and
node-failure detection.
I have a 2-node cluster. Whenever a node fails it seems to take a while
to detect node failure.
First question: I have reduced heartbeat hello_timer to 1 second, and
deadnode_timeout to 5 seconds. Is there an elegant way to do this with
cluster.conf? Currently I'm setting
/proc/cluster/config/cman/hello_timer with an init script hack.
Failure is detected by cman within 5 seconds, no problem, but clustat
hangs during this time.
Second question: clustat continues to hang for around 10 more seconds -
15 in total, before clurgmgrd does a state change.
Does anyone know where this additional 10 seconds comes from? Is it
configurable?
Here is the system log for the transition:
>>>
Mar 19 21:01:33 firthy kernel: CMAN: removing node emsy from the cluster
: Missed too many heartbeats
Mar 19 21:01:33 firthy fenced[1878]: emsy not a cluster member after 0
sec post_fail_delay
Mar 19 21:01:33 firthy fenced[1878]: fencing node "emsy"
Mar 19 21:01:35 firthy fenced[1878]: fence "emsy" success
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> Magma Event: Membership
Change
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> State change: emsy DOWN
<<<
Many thanks,
James Firth
--
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster