Re: totem token & post_fail_delay question

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 26/08/14 07:56, Vasil Valchev wrote:
Hello,

I have a cluster that sometimes has intermittent network issues on the
heartbeat network.
Unfortunately improving the network is not an option, so I am looking
for a way to tolerate longer interruptions.

Previously it seemed to me the post_fail_delay option is suitable, but
after some research it might not be what I am looking for.

If I am correct, when a member leaves (due to token timeout) the cluster
will wait the post_fail_delay before fencing. If the member rejoins
before that, it will still be fenced, because it has previous state?
 From a recent fencing on this cluster there is a strange message:

Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
despite it rejoining the cluster with existing state, it has a lower node ID

What does this mean?


It's an attempt by cman to sort out which node to kill in the situation where a node rejoins too quickly. If both nodes try to send a 'kill' message then then both nodes would leave the cluster leaving you with no active nodes. So cman (and fencing) prioritise the node with the lowest nodeID in an attempt at a tie-break. you should see a corresponding message on the other node: "Killing node %s because it has rejoined the cluster with existing state and has higher node ID"


And lastly is increasing the totem token timeout the way to go?


if there is no option for improving the network situation then, yes, increasing token timeout is probably your best option.

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux