isplist@xxxxxxxxxxxx wrote:
>
First of all, is there a way I can test to see if my Brocade switch is
actually doing any fencing or not? I get the sense it's doing nothing.
I think this because my cluster is terribly unstable. If I reboot a node,
that's fine, it works, the cluster stays up. However, if one of the nodes
crashes in any manner, it takes down everything to the point of having to shut
down every machine and starting it all one at a time.
If a drive get's moved on my FC storage, the cluster crashes. If the storage
is rebooted, the cluster crashes. If I change pretty much anything on the
storage, the cluster crashes, it's nuts. The way it seems to start is that one
node seems to have a kernel panic which sets off the rest.
I know this is limited information but I need somewhere to start. I can't even
begin to think of using this in a production environment, no one would get any
sleep watching over this to make sure it's all up :).
Mike
This almost sounds like the RSCN problem I tried to chase down a while
back. In a nutshell, something changes on the SAN and an RSCN event
occurs, which is seen by all nodes on the SAN. The RSCN event should be
completely harmless, but I have seen it kill all the FC I/O paths, and
that would be bad. I would think that the cluster would stay up, but
nodes would withdraw from the filesystem as soon as they lost the I/O path.
Are you using Qlogic HBAs? If so, check /var/log/messages for any "SCSI
errors".
What you are seeing could be unrelated, but the symptoms sounds roughly
the same.
Ryan
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster