Node Failure Detection Problems

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

I have some questions on configuring and tuning heartbeats and node-failure detection.

I have a 2-node cluster. Whenever a node fails it seems to take a while to detect node failure.

First question: I have reduced heartbeat hello_timer to 1 second, and deadnode_timeout to 5 seconds. Is there an elegant way to do this with cluster.conf? Currently I'm setting /proc/cluster/config/cman/hello_timer with an init script hack.

Failure is detected by cman within 5 seconds, no problem, but clustat hangs during this time.

Second question: clustat continues to hang for around 10 more seconds - 15 in total, before clurgmgrd does a state change.

Does anyone know where this additional 10 seconds comes from? Is it configurable?

Here is the system log for the transition:
>>>
Mar 19 21:01:33 firthy kernel: CMAN: removing node emsy from the cluster : Missed too many heartbeats Mar 19 21:01:33 firthy fenced[1878]: emsy not a cluster member after 0 sec post_fail_delay
Mar 19 21:01:33 firthy fenced[1878]: fencing node "emsy"
Mar 19 21:01:35 firthy fenced[1878]: fence "emsy" success
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> Magma Event: Membership Change
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> State change: emsy DOWN
<<<

Many thanks,
James Firth


--

Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux