Re: Graceful Degradation

Reiner Rottmann <rottmann@xxxxxxx> · Fri, 14 Dec 2007 17:33:57 +0100

Hi Gordan,

you will need to configure fencing to automatically kick out the nodes.
See [1] and [2]. 

As long as the cluster is not informed that the node has been kicked out 
successfully, the cluster freezes the GFS volumes to prevent data corruption.

In case of a Shared Root Cluster (are you using Open-Sharedroot?) the root 
filesystem is also affected because it is formatted with GFS.

If a node is successfully fenced (the nodes will get informed) , the cluster 
will resume its activity. This is a protective measure.

Manual fencing which should only be used in early test environments (like [3]) 
require a manual fence acknowledgement. In case of Open-Sharedroot you will 
need access to a special shell called fenceackshell in order to resolve a 
manual fencing in progress:

---%----------------------------------------------------------------------------------------------------
[root@axqa02rc_1 ~]# telnet axqa02rc_2 12242
Trying 192.168.25.25...
Connected to axqa02rc_2 (192.168.25.25).
Escape character is '^]'.
Username: root
Password: somepassword
FenceacksvVersion $Revision: 1.7 $
Linux axqa02rc_2<singleserver> 2.6.9-55.0.9.ELsmp 89713 x86_64
FENCEACKSV axqa02rc_2<singleserver>$ shell
SHELL FENCEACKSV axqa02rc_2<singleserver>$ cd /sbin
SHELL FENCEACKSV axqa02rc_2<singleserver>$ SHELL FENCEACKSV 
axqa02rc_2<singleserver>$ ./fence_ack_manual
Usage:

fence_ack_manual [options]

Options:
  -h               usage
  -O               override
  -n <nodename>    Name of node that was manually fenced
  -s <ip>          IP address of machine that was manually fenced (deprecated)
  -V               Version information
---%----------------------------------------------------------------------------------------------------

[1] http://en.wikipedia.org/wiki/I/O_Fencing
[2] http://www.redhat.com/docs/manuals/csgfs/admin-guide/ch-fence.html
[3] http://www.open-sharedroot.org/documentation/the-opensharedroot-mini-howto

Cheers,
Reiner

On Friday 14 December 2007 04:54:30 pm gordan@xxxxxxxxxx wrote:
> Hi,
>
> I've got most of my cluster pretty much sorted out, apart from kicking
> nodes from the cluster when they fail.
>
> Is there a way to make the node-kicking automated? I have 4 nodes. They
> are sharing 2 GFS file systems, a root FS and a data FS. If I pull the
> network cable from one of them, or just power it off, the rest of the
> cluster nodes just stop. The only way to get them to start responding
> again is to bring the missing node back, even if there are still enough
> nodes to maintain quorum (3 nodes out of 4).
>
> Can anyone suggest a way around this? How can I make the 3 remaining nodes
> just kick the missing node out of the cluster and DLM group (possibly
> after some timeout, e.g. 10 seconds) and resume operation until the node
> rejoins?
>
> This may or may not be related to the fact that I'm running a shared GFS
> root, but any pointers would be welcome.
>
> Thanks.
>
> Gordan
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Gruss / Regards,

Dipl.-Ing. (FH) Reiner Rottmann

Phone: +49-89 452 3538-12

http://www.atix.de/
http://open-sharedroot.org/

https://www.xing.com/profile/Reiner_Rottmann

PGP Key ID: 0xCA67C5A6
PGP Key Fingerprint = BF59FF006360B6E8D48F26B10D9F5A84CA67C5A6

**
ATIX Informationstechnologie und Consulting AG
Einsteinstr. 10 
85716 Unterschleissheim
Deutschland/Germany

Phone: +49-89 452 3538-0
Fax:   +49-89 990 1766-0

Registergericht: Amtsgericht Muenchen
Registernummer: HRB 168930
USt.-Id.: DE209485962

Vorstand: 
Marc Grimme, Mark Hlawatschek, Thomas Merz (Vors.)

Vorsitzender des Aufsichtsrats:
Dr. Martin Buss
Attachment:
signature.asc

Description: This is a digitally signed message part.
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster