Re: Cluster Crashes

"Ryan O'Hara" <rohara@xxxxxxxxxx> · Mon, 20 Nov 2006 13:47:53 -0600

isplist@xxxxxxxxxxxx wrote:
>
First of all, is there a way I can test to see if my Brocade switch is 
actually doing any fencing or not? I get the sense it's doing nothing.

I think this because my cluster is terribly unstable. If I reboot a node, 
that's fine, it works, the cluster stays up. However, if one of the nodes 
crashes in any manner, it takes down everything to the point of having to shut 
down every machine and starting it all one at a time.

If a drive get's moved on my FC storage, the cluster crashes. If the storage 
is rebooted, the cluster crashes. If I change pretty much anything on the 
storage, the cluster crashes, it's nuts. The way it seems to start is that one 
node seems to have a kernel panic which sets off the rest.

I know this is limited information but I need somewhere to start. I can't even 
begin to think of using this in a production environment, no one would get any 
sleep watching over this to make sure it's all up :).

Mike

This almost sounds like the RSCN problem I tried to chase down a while 
back. In a nutshell, something changes on the SAN and an RSCN event 
occurs, which is seen by all nodes on the SAN. The RSCN event should be 
completely harmless, but I have seen it kill all the FC I/O paths, and 
that would be bad. I would think that the cluster would stay up, but 
nodes would withdraw from the filesystem as soon as they lost the I/O path.

Are you using Qlogic HBAs? If so, check /var/log/messages for any "SCSI 
errors".

What you are seeing could be unrelated, but the symptoms sounds roughly 
the same.

Ryan

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster