Hi Gordan, you will need to configure fencing to automatically kick out the nodes. See [1] and [2]. As long as the cluster is not informed that the node has been kicked out successfully, the cluster freezes the GFS volumes to prevent data corruption. In case of a Shared Root Cluster (are you using Open-Sharedroot?) the root filesystem is also affected because it is formatted with GFS. If a node is successfully fenced (the nodes will get informed) , the cluster will resume its activity. This is a protective measure. Manual fencing which should only be used in early test environments (like [3]) require a manual fence acknowledgement. In case of Open-Sharedroot you will need access to a special shell called fenceackshell in order to resolve a manual fencing in progress: ---%---------------------------------------------------------------------------------------------------- [root@axqa02rc_1 ~]# telnet axqa02rc_2 12242 Trying 192.168.25.25... Connected to axqa02rc_2 (192.168.25.25). Escape character is '^]'. Username: root Password: somepassword FenceacksvVersion $Revision: 1.7 $ Linux axqa02rc_2<singleserver> 2.6.9-55.0.9.ELsmp 89713 x86_64 FENCEACKSV axqa02rc_2<singleserver>$ shell SHELL FENCEACKSV axqa02rc_2<singleserver>$ cd /sbin SHELL FENCEACKSV axqa02rc_2<singleserver>$ SHELL FENCEACKSV axqa02rc_2<singleserver>$ ./fence_ack_manual Usage: fence_ack_manual [options] Options: -h usage -O override -n <nodename> Name of node that was manually fenced -s <ip> IP address of machine that was manually fenced (deprecated) -V Version information ---%---------------------------------------------------------------------------------------------------- [1] http://en.wikipedia.org/wiki/I/O_Fencing [2] http://www.redhat.com/docs/manuals/csgfs/admin-guide/ch-fence.html [3] http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto Cheers, Reiner On Friday 14 December 2007 04:54:30 pm gordan@xxxxxxxxxx wrote: > Hi, > > I've got most of my cluster pretty much sorted out, apart from kicking > nodes from the cluster when they fail. > > Is there a way to make the node-kicking automated? I have 4 nodes. They > are sharing 2 GFS file systems, a root FS and a data FS. If I pull the > network cable from one of them, or just power it off, the rest of the > cluster nodes just stop. The only way to get them to start responding > again is to bring the missing node back, even if there are still enough > nodes to maintain quorum (3 nodes out of 4). > > Can anyone suggest a way around this? How can I make the 3 remaining nodes > just kick the missing node out of the cluster and DLM group (possibly > after some timeout, e.g. 10 seconds) and resume operation until the node > rejoins? > > This may or may not be related to the fact that I'm running a shared GFS > root, but any pointers would be welcome. > > Thanks. > > Gordan > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster -- Gruss / Regards, Dipl.-Ing. (FH) Reiner Rottmann Phone: +49-89 452 3538-12 http://www.atix.de/ http://open-sharedroot.org/ https://www.xing.com/profile/Reiner_Rottmann PGP Key ID: 0xCA67C5A6 PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6 ** ATIX Informationstechnologie und Consulting AG Einsteinstr. 10 85716 Unterschleissheim Deutschland/Germany Phone: +49-89 452 3538-0 Fax: +49-89 990 1766-0 Registergericht: Amtsgericht Muenchen Registernummer: HRB 168930 USt.-Id.: DE209485962 Vorstand: Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.) Vorsitzender des Aufsichtsrats: Dr. Martin Buss
Attachment:
signature.asc
Description: This is a digitally signed message part.
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster