We have a GFS cluster with 12 data nodes and 3 lock-servers. Red Hat AS3 U7 GFS-6.0.2.30-0 The data nodes all access a SAN disk. The SAN fabric is divided into two independent halves - called Red and Blue. Half the data nodes on each. The data nodes access only one disk - reachable via either SAN. There are other clients, other clusters and other disks sharing the SAN. Recently a faulty HBA was plugged into a machine, not part of our cluster, and connected to the Red SAN. At this point the Red SAN failed, there were two main moderately immediate results: One of the Red SAN nodes became very busy. Presumably it was holding a fairly big GFS lock at the time. But it continued to hold the lock and to send heartbeats. The node gave the appearance of being hung. The rest of the Red SAN nodes, over a period of a few minutes, all presumably did some IO to the disk and presumably got into a busy wait state, which was so tight that they stopped sending heartbeats, and got fenced. (APC PDU's) On reboot these nodes could see the SAN as normal except they could not see their SAN disk. Nor could they see another disk added to the SAN as part of the debugging attempted later. Many attempts were made to make the disk reappear, mostly by rebooting or shutting down GFS and rmmod-ing qla2300 and modprobe-ing qla2300. Everything was quite normal, except the Red SAN would not let any of our nodes see our disk. On the Blue SAN all the machines became very busy. Presumably because of the one Red SAN machine holding the lock. These nodes were also thought to be hung, but none of them were rebooted as it was discovered that they were still exporting an important Web tree that was not on GFS disk. (They sprang back to life when the one - lock holding - Red SAN machine was rebooted - which was well after the Red SAN problem was fixed). This state of affairs lasted 12 hours. Fixing it was made difficult because to anyone looking at the problem it appeared the entire SAN and the entire cluster was down. Very little that we saw at the time indicated that only the Red SAN had failed. (Hindsight is wonderful). This was particularly unfortunate. The justification for installing GFS was resilience in the face of hardware failure. (esp no spof). So finally here are my questions: Is it really reasonable for a machine to hang onto a lock for 12 hours ? Would it be possible for a GFS machine to detect that it cannot do IO to its GFS disk any more and release any locks it holds - perhaps by fencing itself? (I'm thinking of adding a cronjob that forks a subprocess that does an IO to the GFS disk. The parent could shutdown the node, leading to a fence, if the child takes more than a minute). Have I made any mistakes in my guesses and presumptions ? Keith Lewis -- Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster