Hi Matteo, First off, you are correct. Strictly from a "SPF protection / all other failure scenarios are irrelevant" point of view, losing power -> fencing failure is bad. However, I hope I can convince you that this particular view is not the right one to take in this case, but I doubt I will be able to. On Wed, 2006-03-22 at 17:17 +0100, Matteo Catanese wrote: > We are always talking about avoiding _single point of failure_, not > multiple ones. We recover from several multi-point failures if there is a deterministic way to do so. Ex, sustaining 5 nodes failing in a 16-node cluster. More so than NSPF, the cluster is designed to minimize uncertainty in any failure case if possible - especially where data integrity is concerned (i.e. fencing). Given the above design goal, one can still very easily build NSPF two-node clusters, but there are limitations on the hardware you can use. For example - * With iLO, you need redundant power supplies. * With IPMI, you need redundant power supplies and an extra NIC. * With single power supplies, you should use a remote power switch with redundant power rails (where the internal electronics can run off of either for full NSPF protection). As of this writing, I am unaware of any such thing available from any of the major IHVs. * If redundant power supplies are not "redundant enough" in your opinion, then you should probably use a redundant remote power switch as noted above. > So please at least for fence_ilo allow some parameter to let fence > spit out a warning and unlock the cluster service Fencing, put simply, is a deterministic set of steps to take to guarantee that a dead or misbehaving node can not (not "might not" or "probably will not") access shared resources/partitions/storage. It is designed to have exactly two possible outcomes given a correctly configured environment: - The node has been cut off from shared resources, or - Fencing the node has failed If fencing fails, we retry forever. Fencing failures are otherwise unrecoverable. The only way to recover from a particular fencing failure is to provide a different fencing mechanism as a backup... Ok, on how one could change the behavior... >From a design perspective, if we were to change the behavior of fencing, I would recommend changing it in fenced, not fence_ilo (e.g. give a fenced a max_retries count or something), because once we do it for iLO, we will have to do it for many other agents. For example, most or all of the supported APC switches only have a single (non-redundant) power rail, so fence_apc would have to be changed too. Here are some things you can do for your configuration: (a) Add a human layer. Add a manual fencing agent as a cascade to detect this particular problem. This is, in my opinion, the least likely to solve your problem in the way you want, but if you consider a power failure of a node fairly unlikely. (b) Make fencing not fail. Edit /sbin/fence_ilo and make it do what you need. (c) Roll your own fencing agent and add it as a cascade which will do specifically what you want it to if iLO fencing fails. For example, /sbin/fence_dontcare. #!/bin/bash logger -p "daemon.emerg" "WARNING - iLO failed; data integrity may be compromised, but continuing anyway." echo "Ruh roh!" | mail my@xxxxxxxxxx exit 0 Don't forget to add fence references to your cluster.conf. (d) Buy a redundant external power switch as a cascade (or primary fencing method) in the case that iLO is unreachable. Here is a WTI NPS on eBay for $125: http://cgi.ebay.com/WTI-NPS-115-Remote-Telnet-Power-Reboot-NIB-Switch_W0QQitemZ9701395350QQcategoryZ11175QQssPageNameZWDVWQQrdZ1QQcmdZViewItem The NPS has two power rails, and the internal electronics can run off of either. I.E., you can actually build a NSPF configuration with nodes w/o redundant power supplies - without having to weaken any guarantees about data integrity. (Note: the NPS 115 has is past its end of life; WTI has a replacement, but it will cost more than $125.). -- Lon -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster