On Tue, 2006-03-07 at 17:04 +0100, Matteo Catanese wrote: > Result: One node perfectly up but cluster service stalled Fencing never completes because iLO does not have power. This an architectural limitation to using iLO (or IPMI, actually) in a cluster environment as the sole fencing method. Compare to RSA - which can have its own external power supply - even though it is an integrated solution like iLO. With redundant power supplies, the expectation is that different circuits (or preferably - different power sources entirely) are used, which should make the tested case significantly less likely to occur. > Switch time: 55 seconds (+ oracle startup time). Hrm, the backup node should take over the service after the primary node is confirmed 'dead', i.e. after fencing is complete. It should certainly not be waiting around for the other node to come back to life. What does your fence + service configuration look like, and were there any obvious log messages which might explain the odd behavior? > Cluster is stalled > > Can you change fence behaviour to be less "radical" ? > > If ILO is unreachable means that machine is already off and could not > be powered on so fence shold spit out a warning and let the failover > happen iLO being unreachable means iLO is unreachable, and assumptions as to why should probably not be limited to lack of power. Routing problems, bad network cable, disconnected cable, and the occasional infinite iLO-DHCP loop will all make iLO unreachable, but in no way confirm that the node is dead. More to the point, though, you can get around this particular behavior (fencing on startup -> hang because fencing fails) by starting fenced with the clean start parameter. In a two node cluster, this is useful to start things up in a controlled way when you know you won't be able to fence the other node. I think it's: fence_tool join -c If you (the administrator) are sure that the node is dead and does not have any services running, it will cause fenced to not fence the other node on startup, thereby avoiding the hang entirely. However, automatically doing this is unsafe if both nodes are booting while a network partition exists between the nodes, the cluster will end up with a split brain. -- Lon -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster