Rebooting the Master Node in an RHCS Cluster

<gcharles@xxxxxxx> · Tue, 26 Oct 2010 08:52:41 -0400

Hello,

Was wondering if 
anyone else has ever run into this.  We have a three-node RHCS 
cluster:

Three Proliant 
DL380-G6s, 48G memory
Dual network, power, 
QLogic HBAs for redundancy
EMC 
SAN
RHEL 5.5  
kernel 2.6.18-194.el5

All three in an RHCS 
cluster, 12 Oracle database services.  The cluster itself runs fine under 
normal conditions, and all failovers function as expected.  There is only 
one failover domain configured, and all three nodes are members of that 
domain.  Four of the Oracle database services contain GFS2 file systems; 
the rest are ext3.

The problem is when 
we attempt a controlled shutdown of the current master node.  We have 
tested in the following situations:

1.  Node 1 is 
the current master and not running any services.  Node 2 is also not 
running any services.  Node 3 is running all 12 services.  We 
hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power 
management) and node 2 immediately takes over the master role and the services 
stay where they are and continue to function.  I believe this is the 
expected behavior.

2.  Node 1 
is the current master and not running any services.  Three services are on 
node 2, and node 3 is running the rest.  Again, we hard-fail node 1 as 
described above and node 2 assumes the master role and the services stay where 
they are and continue to function.

3.  Repeating 
the same steps as above; node 1 is the master and not running any services, 
three services on node 2 and the rest on node three.  This time we perform 
a controlled shutdown of node 1 to "properly" remove it from the cluster (let's 
say we're doing a rolling patch of the OS on the nodes) with the following steps 
on the master node:
 - Unmount any 
GFS file systems.
 - service 
rgmanager stop; service gfs2 stop; service gfs stop  (clustat shows 
node1 Online but no rgmanager, as expected)
 - fence_tool 
leave    (this removes node 1 from the fence group in the hopes 
that the other nodes don't try to fence it as it is 
rebooting)
 - service 
clvmd stop
 - cman_tool 
leave remove
 - service 
qdiskd stop
 - 
shutdown
Everything appears 
normal until we execute the 'cman_tool leave remove'.  At that point the 
cluster log on node 2 and node 3 shows "Lost contact with quorum device" (we 
expect that) but also shows "Emergency stop of services" for all 12 
services.  While access to the quorum device is restored almost immediately 
(node 2 takes over the master role), rgmanager is temporarily unavailable on 
nodes 2 and 3 while the cluster basically reconfigures itself, restarting all 12 
services.  Eventually all 12 services properly restart (not necessarily on 
the original node they were on) and when node 1 finishes rebooting, it properly 
rejoins itself to the cluster.  Node 2 retains itself as 
Master.

If I do the same 
tests as above and reboot a node that is NOT the master, the services remain 
where they are and the cluster does not reconfigure itself or restart any 
services.

My questions are, 
Why does the cluster reconfigure itself and restart ALL services regardless of 
what node they are on when I do a controlled shutdown of the current Master 
node?  Do I have to hard-reset the Master node in an RHCS cluster so the 
remaining services don't get restarted?  Why does the cluster completely 
reconfigure itself when the Master node is 'properly' 
removed.

Thanks for your 
help, and any suggestions would be appreciated.

Greg 
Charles
Mid Range Systems
gcharles@xxxxxxx 

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster