Re: totem token & post_fail_delay question

Christine Caulfield <ccaulfie@xxxxxxxxxx> · Tue, 26 Aug 2014 09:23:14 +0100

On 26/08/14 07:56, Vasil Valchev wrote:
Hello,

I have a cluster that sometimes has intermittent network issues on the
heartbeat network.
Unfortunately improving the network is not an option, so I am looking
for a way to tolerate longer interruptions.

Previously it seemed to me the post_fail_delay option is suitable, but
after some research it might not be what I am looking for.

If I am correct, when a member leaves (due to token timeout) the cluster
will wait the post_fail_delay before fencing. If the member rejoins
before that, it will still be fenced, because it has previous state?
 From a recent fencing on this cluster there is a strange message:

Aug 24 06:20:45 node2 openais[29048]: [MAIN ] Not killing node node1cl
despite it rejoining the cluster with existing state, it has a lower node ID

What does this mean?

It's an attempt by cman to sort out which node to kill in the situation 
where a node rejoins too quickly. If both nodes try to send a 'kill' 
message then then both nodes would leave the cluster leaving you with no 
active nodes. So cman (and fencing) prioritise the node with the lowest 
nodeID in an attempt at a tie-break. you should see a corresponding 
message on the other node:
"Killing node %s because it has rejoined the cluster with existing state 
and has higher node ID"

And lastly is increasing the totem token timeout the way to go?

if there is no option for improving the network situation then, yes, 
increasing token timeout is probably your best option.

Chrissie

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster