Hello,
Was wondering if
anyone else has ever run into this. We have a three-node RHCS
cluster:
Three Proliant
DL380-G6s, 48G memory
Dual network, power,
QLogic HBAs for redundancy
EMC
SAN
RHEL 5.5
kernel 2.6.18-194.el5
All three in an RHCS
cluster, 12 Oracle database services. The cluster itself runs fine under
normal conditions, and all failovers function as expected. There is only
one failover domain configured, and all three nodes are members of that
domain. Four of the Oracle database services contain GFS2 file systems;
the rest are ext3.
The problem is when
we attempt a controlled shutdown of the current master node. We have
tested in the following situations:
1. Node 1 is
the current master and not running any services. Node 2 is also not
running any services. Node 3 is running all 12 services. We
hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power
management) and node 2 immediately takes over the master role and the services
stay where they are and continue to function. I believe this is the
expected behavior.
2. Node 1
is the current master and not running any services. Three services are on
node 2, and node 3 is running the rest. Again, we hard-fail node 1 as
described above and node 2 assumes the master role and the services stay where
they are and continue to function.
3. Repeating
the same steps as above; node 1 is the master and not running any services,
three services on node 2 and the rest on node three. This time we perform
a controlled shutdown of node 1 to "properly" remove it from the cluster (let's
say we're doing a rolling patch of the OS on the nodes) with the following steps
on the master node:
- Unmount any
GFS file systems.
- service
rgmanager stop; service gfs2 stop; service gfs stop (clustat shows
node1 Online but no rgmanager, as expected)
- fence_tool
leave (this removes node 1 from the fence group in the hopes
that the other nodes don't try to fence it as it is
rebooting)
- service
clvmd stop
- cman_tool
leave remove
- service
qdiskd stop
-
shutdown
Everything appears
normal until we execute the 'cman_tool leave remove'. At that point the
cluster log on node 2 and node 3 shows "Lost contact with quorum device" (we
expect that) but also shows "Emergency stop of services" for all 12
services. While access to the quorum device is restored almost immediately
(node 2 takes over the master role), rgmanager is temporarily unavailable on
nodes 2 and 3 while the cluster basically reconfigures itself, restarting all 12
services. Eventually all 12 services properly restart (not necessarily on
the original node they were on) and when node 1 finishes rebooting, it properly
rejoins itself to the cluster. Node 2 retains itself as
Master.
If I do the same
tests as above and reboot a node that is NOT the master, the services remain
where they are and the cluster does not reconfigure itself or restart any
services.
My questions are,
Why does the cluster reconfigure itself and restart ALL services regardless of
what node they are on when I do a controlled shutdown of the current Master
node? Do I have to hard-reset the Master node in an RHCS cluster so the
remaining services don't get restarted? Why does the cluster completely
reconfigure itself when the Master node is 'properly'
removed.
Thanks for your
help, and any suggestions would be appreciated.
Greg
Charles
Mid Range Systems
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster