GFS withdraw and/or node I/O errors affect whole cluster?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Using a 14-node cluster on CentOS 5.2 with GFS1.
 
We've observed a problem in production that caused us to peform an
unplanned cluster restart.  We also reproduced similar behavior in a lab
environment.
 
If one node loses its connection to shared storage, it can no longer
perform any filesystem activity.  The GFS filesystem may decide to
withdraw.  That's expected.
 
The same node that withdraws does not get fenced.  Since the cluster
itself depends on networking and not storage, and cluster services other
than GFS may be active, that's not surprising.
 
When one node withdraws or otherwise fails on a GFS mount without
getting fenced, other nodes freeze when attempting to access the same
filesystem.  That's unexpected.
 
For a high-availabliity cluster, this can be a bad thing, because it
isn't handled automatically and effectively causes a cluster-wide
outage.  Does this sound right?  How can we mitigate or prevent such
outages?  Are there relevant configuration settings I've missed?
 
Thanks for any insight.
 
Jeff


--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux