Re: Tripp Lite switched PDU fence agent; exists?

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Fri, 18 Mar 2011 21:40:55 +0100

On 3/18/2011 9:20 PM, bergman@xxxxxxxxxxxx wrote:
> The pithy ruminations from "Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> on "Re:  Tripp Lite switched PDU fence agent; exists?" were:
> 
> 
> => 
> => Wouldn´t it be possible for the agent to:
> => 
> => 1) issue OFF command
> => 2) either poll for OFF status or wait > $known_random_max_delay
> => 3) issue ON command
> => 4) profit?
> 
> 
> Yes, but here's the problem:
> 
> 	0) there's a condition whereby cluster communication is lost between nodeA and nodeB
> 	1) the agent on nodeA sends OFF command to PDU to shut down nodeB
> 	2) the agent on nodeA polls for OFF status while waiting > $known_random_max_delay
> 	3) the agent on nodeB sends OFF command to PDU to shut down nodeA
> 	4) nodeB shuts down
> 	5) nodeA shuts down
> 
> The PDU responds quickly to network connections (ie., telnet & commands to shut down a power outlet). The PDU accepts multiple network sessions (ie., from nodeA and nodeB). The PDU delays executing the commands, potentially leaving enough time for multiple nodes to send commands each to shut down the "other" node.

This is virtually true for all 2 nodes clusters and it´s a very well
known fencing race condition.

there are several mechanisms to avoid it:

1) fence delay option. One node basically sleeps N seconds before it can
fence
2) both cluster heartbeat traffic and fence devices are on the same
network (if node A can´t access the net, it also can´t access the fence
device)
3) qdiskd + heuristics
4) use a fence device that allows only one connection at a time (one
node access, the other is forbidden)

and note that it is independent on how long the device takes to fence
the node.

Fabio

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster