Re: Power based fencing in cluster causes single point of failure that can take down a cluster

Jonathan Biggar <jon@xxxxxxxxxxx> · Tue, 09 Jan 2007 11:22:10 -0800

Josef Whiter wrote:
You can either have redundant fence devices, or look into qdisk.

Thanks for the reply.  Can you explain how qdisk would solve the 
problem?  It seems to me that the fencing device failing which 
simultaneously causes the cluster member to fail wouldn't be affected by 
qdisk.

Does qdisk have some feedback mechanism that tells the cluster that it's 
ok to restart the failed services on another node without fencing being 
successful?  I can't see how that can work reliably and still prevent 
split brain problems.

On Tue, Jan 09, 2007 at 10:50:53AM -0800, Jonathan Biggar wrote:
If we set up a cluster and use network power switches for fencing, won't 
the failure of the power switch attached to a cluster member cause all 
services that were running on that node to fail to migrate to other 
cluster members?

This seems to happen to us in practice, because fencing the offline 
member fails due to the power switch being unavailable, so rgmanager 
never migrates the failed service(s) to another member.

Is there a general solution to this problem that I'm missing?

--
Jon Biggar
Levanta
jon@xxxxxxxxxxx
650-403-7252

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster