On Tue, 2006-11-21 at 08:59 +0200, Janne Peltonen wrote: > Hi! > > I started wondering what happens if my fence device is broken. The > scenario: > > -a node (running a service) fails > -another node notices the lost heartbeats and tries to fence the failed > node > -however, the fence device doesn't respond > -...what now? Fencing retries forever. You can build redundant fencing if you're worried about it. > Not good. If the active node fails, and the fence device fails at the > same time - for example, if the active node is a Xen guest and the host > Xen fails, or if the active node loses power because the network power > switch fails or because the iLO gets confused - the service is lost. > The Xen scenario doesn't even seem too far-fetched... [except for VMs; see below] This is an unrecoverable double failure - because there is no certainty as to the cause. For example, if your power switch loses power, it appears exactly the same to the cluster as unplugging the network cable to both the node and the power switch. We solve the virtual machine situation by: (a) requiring that the host nodes where the VM cluster resides to be a member of a cluster and have fencing of their own, and (b) storing the last-known location of the VM in an AIS checkpoint If the VM crashes, we simply ask the host cluster to fence the VM. The owner of the VM responds, and issues the equivalent of 'xm destroy'. If the physical node has crashed, the physical cluster will notice the physical node has crashed and kill that node. When a fencing request comes in for a VM which was previously running on that node, another node in the physical cluster can then respond that the VM has also been fenced (because the cluster knows the last known location of the VM, and that node has been fenced). -- Lon -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster