On Tue, 2008-09-09 at 21:19 +0200, Gerhard Spiegl wrote: > Hello Kevin, > > thanks for your reply. > > Kevin Anderson wrote: > > On Tue, 2008-09-09 at 17:51 +0200, Gerhard Spiegl wrote: > >> Hello all! > >> > >> We are trying to set up a 20 node cluster and want to use a > >> "ping-heuristic" and a heuristic that checks the state of > >> the fiberchannel ports. > > > > What actions do you want to take place based on these heuristics? > > The node should get fenced (or fence/reboot itself) if the public > interface (bond0) looses connection - or both paths (dm-multipath) > to the storage get lost. > > Without quorum device: > We faced the problem of complete loss of storage connectivity resulting > the GFS to withdraw (only when IO is issued on it (we only use GFS for > xen vm definition files)), causing GFS and CLVM to lockup and never > released. Only the manual reboot/halt solves the situation (in addition > the specific node gets fenced after poweroff - trifle to late ;)). You can avoid the withdraw and force a panic by using the debug mount option for your GFS filesystems. With debug set, GFS will when getting an I/O error, panic the system effectively self fencing the node. The reason behind withdraw was to give the operator a chance to gracefully remove the node from the cluster after filesystem failure. This is useful when multiple filesystems are mounted with multiple storage devices. A withdraw always requires rebooting the node to recover. However, in your case, panic action is probably what you want. We recently opened a new bugzilla for a new feature to give you better control of the options in this case. https://bugzilla.redhat.com/show_bug.cgi?id=461065 Anyway, the debug mount option should avoid the situation you are describing. > > > you are worried about split sites, just add an additional node to the > > cluster that is some other location. The node would only be used for > > quorum votes. > > I am not sure what you mean with split sites (split brain?), but thats not the > issue. Do you mean an additional node without any service or failoverdomain > configured? With split sites and an even number of nodes, you could end up in the situation that if an entire site goes down, you no longer have cluster quorum. Having an extra and therefor odd number of nodes in the cluster would enable the cluster to continue to operate at the remaining site. But, not the problem you were trying to solve in this case. Try -o debug on your GFS mount options. Thanks Kevin -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster