On 29/01/14 10:14 AM, Nicolas Kukolja wrote:
Hello,
I have a cluster with three nodes (rhel 5.5) and every server has an
ipmilan-module configured as fencing device in my cluster-config.
Now, if one of the nodes is not reachable and its fencing device is not
reachable, too, then the other two nodes try to fence this node again
and again... without stopping it.
Only when this node is reachable (& fenceable) again, the fencing
proceeds sucessfully and the cluster service moves to another node.
Why does the service not move to another node earlier? I think, its a
common error scenario, that one node and its fencing device are not
reachable maybe due to power problems e.g.
How do I have to change the cluster configuration to retrieve my
expected behaviour?
Thanks in advance for any suggestions...
Kind regards,
Nicolas
This behaviour is expected and by design. The healthy nodes can't safely
recover until they know what state the lost node is in. The cluster is
not allowed to simply assume that the lost node is dead (no way to tell
"disconnected but working" from "smouldering pile of rubble").
The way I deal with this is a second fence method. I use a pair of
switched PDUs behind each node (one PDU for the first PSU in each node
and the second PDU for the second PSU in each node). This way, if IPMI
fencing fails, the nodes will connect to the PDUs and cut the power to
the lost node, thus ensuring it's off and allowing prompt recovery of
services.
This might help:
* https://alteeve.ca/w/AN!Cluster_Tutorial_2#Why_Switched_PDUs.3F
* https://alteeve.ca/w/AN!Cluster_Tutorial_2#A_Map.21
* https://alteeve.ca/w/AN!Cluster_Tutorial_2#Using_the_Fence_Devices
Cheers
--
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster