...answered my own question...or, the helpful message answered my question. I can reset it manually using fence_ack_manual. Node blade09 needs to be reset before recovery can procede. W aiting for blade09 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n blade09) On 12, May, 2005, Dan B. Phung declared: > My question is in reference to node failures using fence_manual > >From 'man fenced' > > Node failure > When a domain member fails, the actual fencing must be completed before > GFS recovery can begin. This means any delay in carrying out the > fencing operation will also delay the completion of GFS file system > operations; most file system operations will hang during this period. > > So this is what I'm seeing now when a node fails, ie. the rest of the > nodes notice that the heartbeats of a certain node A has timed out. Node A > is fenced by ther remaining nodes, and the file system is hung. My > questions are: > > 1) can I call fence_ack_manual right when I see that node A is fenced, or > do I have to wait for node A to reboot, come back, and join the cluster? > > 2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely > for the failed node to rejoin the cluster, which it seems to be doing, > so is this the default? The man page shows: > <fence_daemon post_fail_delay="0"> > > So with my assumption of the delay being 0, I expected the node to be > fenced instantly on timeout, recovery to begin and complete, and my file > system for the rest of the nodes to be usable in a relatively short time. > I guess if the answer to 1) is that this recovery is done manually with > the fence_ack_manual, then it all makes sense. > > thanks, > dan > > -- -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster