My question is in reference to node failures using fence_manual >From 'man fenced' Node failure When a domain member fails, the actual fencing must be completed before GFS recovery can begin. This means any delay in carrying out the fencing operation will also delay the completion of GFS file system operations; most file system operations will hang during this period. So this is what I'm seeing now when a node fails, ie. the rest of the nodes notice that the heartbeats of a certain node A has timed out. Node A is fenced by ther remaining nodes, and the file system is hung. My questions are: 1) can I call fence_ack_manual right when I see that node A is fenced, or do I have to wait for node A to reboot, come back, and join the cluster? 2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely for the failed node to rejoin the cluster, which it seems to be doing, so is this the default? The man page shows: <fence_daemon post_fail_delay="0"> So with my assumption of the delay being 0, I expected the node to be fenced instantly on timeout, recovery to begin and complete, and my file system for the rest of the nodes to be usable in a relatively short time. I guess if the answer to 1) is that this recovery is done manually with the fence_ack_manual, then it all makes sense. thanks, dan -- -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster