On 26/08/14 07:56, Vasil Valchev wrote:
Hello, I have a cluster that sometimes has intermittent network issues on the heartbeat network. Unfortunately improving the network is not an option, so I am looking for a way to tolerate longer interruptions. Previously it seemed to me the post_fail_delay option is suitable, but after some research it might not be what I am looking for. If I am correct, when a member leaves (due to token timeout) the cluster will wait the post_fail_delay before fencing. If the member rejoins before that, it will still be fenced, because it has previous state? From a recent fencing on this cluster there is a strange message: Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl despite it rejoining the cluster with existing state, it has a lower node ID What does this mean?
It's an attempt by cman to sort out which node to kill in the situation where a node rejoins too quickly. If both nodes try to send a 'kill' message then then both nodes would leave the cluster leaving you with no active nodes. So cman (and fencing) prioritise the node with the lowest nodeID in an attempt at a tie-break. you should see a corresponding message on the other node: "Killing node %s because it has rejoined the cluster with existing state and has higher node ID"
And lastly is increasing the totem token timeout the way to go?
if there is no option for improving the network situation then, yes, increasing token timeout is probably your best option.
Chrissie -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster