Re: [Linux-cluster] Configuring rgmanager

Ion Alberdi <ialberdi@xxxxxxxxx> · Mon, 28 Feb 2005 14:05:21 +0100

Failover will not occur until after CMAN (or gulm) says the node is dead
and has been fenced.  When using the kernel Service Manager (provided by
CMAN), recovery is in the following order:

(1) Fencing
(2) Locking
(3) GFS
(4) User services (e.g. rgmanager)

How long did you wait? :)

Results from my tests with two nodes(buba and gump)(and latest 
cvs(update done today)):

I tried to put a basic script in failover on two nodes.

Initialization:

-ccsd, cman_tool join fence_tool join on the two nodes

Then I start the rgmanager on the two nodes:

the script coucou (echo `uname -n` >> bla.txt) is launched on one of 
the two nodes.

With clusvcadm I made this script ran on gump, and I rebooted gump:

There is the syslog on buba:

Feb 28 13:20:17 buba kernel: CMAN: removing node gump from the cluster : 
Missed too many heartbeats

Feb 28 13:20:17 buba fenced[7573]: gump not a cluster member after 0 sec 
post_fail_delay

Feb 28 13:20:17 buba fenced[7573]: fencing node "gump"

Feb 28 13:20:20 buba fence_manual: Node 200.0.0.102 needs to be reset 
before recovery can procede.  Waiting for 200.0.0.102 to rejoin the 
cluster or for manual acknowledgement that it has been reset (i.e. 
fence_ack_manual -n 200.0.0.102)

Feb 28 13:20:29 buba fenced[7573]: fence "gump" success

Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Taking over resource 
group coucou from down member (null)

Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Resource group coucou started

Then gump came and rejoined the cluster: syslog of buba:
Feb 28 13:23:47 buba kernel: CMAN: node gump rejoining

I put the script on gump(always with clusvcadm):

Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Stopping resource group 
coucou

Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is 
stopped

Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is 
now running on member 2

Then I re rebooted (:)) gump and there came the problems:

Gump was removed from the cluster

Feb 28 13:25:57 buba kernel: CMAN: removing node gump from the cluster : 
Missed too many heartbeats

Feb 28 13:25:57 buba fenced[7573]: gump not a cluster member after 0 sec 
post_fail_delay

Feb 28 13:25:57 buba fenced[7573]: fencing node "gump"

Feb 28 13:26:03 buba fence_manual: Node 200.0.0.102 needs to be reset 
before recovery can procede.  Waiting for 200.0.0.102 to rejoin the 
cluster or for manual acknowledgement that it has been reset (i.e. 
fence_ack_manual -n 200.0.0.102)

Feb 28 13:26:14 buba fenced[7573]: fence "gump" success

And there, the rgmanager did nothing
when I looked to /proc/cluster/services I had:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 recover 2 -
[1]

Whereas I had, during the first reboot of gump:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 run       -
[1]

Then I tried to bring gump back:

And there is what I had in gump:
[root@gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

User:            "usrm::manager"                     0   3 join      
S-1,80,2

[]

So there, nothing worked, I hopelessly tried to restart the rgmanager on 
the two nodes, but nothing worked, I had states where in gump

[root@gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "Magma"                             3   6 run       -
[2]

User:            "usrm::manager"                     4   5 run       -
[2]

and in buba:
[root@buba ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

(buba seems not to have any clurgmrgrd running, even if I started the 
rgmanager...)

I don't know if it's a bug of the rgmanager or if I'm doing something 
wrong, but I don't understand why during the first reboot everything 
worked and nothing then...