How are you fencing? I noticed a condition on certain brocade switches where the fence_brocade script effectively kills the entire switch. > -----Original Message----- > From: linux-cluster-bounces@xxxxxxxxxx > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of David Teigland > Sent: Wednesday, June 08, 2005 10:04 PM > To: Dan B. Phung > Cc: Linux-cluster@xxxxxxxxxx > Subject: Re: failed node causes all GFS > systems to hang > > On Wed, Jun 08, 2005 at 05:46:26PM -0400, Dan B. Phung wrote: > > > I think I'm doing something terribly wrong here, because if > one of my > > nodes goes down, the rest of the nodes connected to GFS are hung in > > some wait state. Specifically, only those nodes running > fenced are hosed. > > These machines are not only blocked on the GFS's file > system, but the > > local file system stuff is hung as well, which requires me > to reboot > > everybody connected to GFS. I have one node not running fenced to > > reset the quorum status, so that doesn't seem to be the problem. > > > > I updated from the cvs sources -rRHEL4 last friday, so I have up to > > date stuff. i'm running kernel 2.6.9 and fence_manual. I > remember a > > couple of weeks back that when a node went down, I simply had to > > fence_ack_manual the node, but that message never comes up > anymore... > > The joys of manual fencing, we do debate sometimes whether > it's more troublesome than helpful for people. > > When a node fails, you need to run fence_ack_manual on one of > the remaining nodes, specifically, whichever remaining node > has a fence_manual notice in /var/log/messages. So, you need > to monitor /var/log/messages on the remaining nodes to figure > out where you need to run fence_ack_manual (it will generally > be the remaining node with the lowest nodeid, see cman_tool nodes). > > If the failed node caused the cluster to loose quorum, then > it's a different story. In that case you need to get some > nodes back into your cluster (cman_tool join) to regain > quorum before any kind of fencing will happen. > > GFS is going to be blocked everywhere until you run > fence_ack_manual for the failed node. If there are no manual > fencing notices anywhere for the failed node, then maybe you > lost quorum (see cman_tool status), or something else is > wrong. I don't know why your local fs would be hung. > > Dave > > -- > > Linux-cluster@xxxxxxxxxx > http://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster@xxxxxxxxxx http://www.redhat.com/mailman/listinfo/linux-cluster