Re: Rebooting the Master Node in an RHCS Cluster

Juan Ramon Martin Blanco <robejrm@xxxxxxxxx> · Tue, 26 Oct 2010 15:27:15 +0200



On Tue, Oct 26, 2010 at 2:52 PM,  <gcharles@xxxxxxx> wrote:
> Hello,
>
> Was wondering if anyone else has ever run into this.  We have a three-node
> RHCS cluster:
>
> Three Proliant DL380-G6s, 48G memory
> Dual network, power, QLogic HBAs for redundancy
> EMC SAN
> RHEL 5.5  kernel 2.6.18-194.el5
>
> All three in an RHCS cluster, 12 Oracle database services.  The cluster
> itself runs fine under normal conditions, and all failovers function as
> expected.  There is only one failover domain configured, and all three nodes
> are members of that domain.  Four of the Oracle database services contain
> GFS2 file systems; the rest are ext3.
>
> The problem is when we attempt a controlled shutdown of the current master
> node.  We have tested in the following situations:
>
> 1.  Node 1 is the current master and not running any services.  Node 2 is
> also not running any services.  Node 3 is running all 12 services.  We
> hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power
> management) and node 2 immediately takes over the master role and the
> services stay where they are and continue to function.  I believe this is
> the expected behavior.
>
> 2.  Node 1 is the current master and not running any services.  Three
> services are on node 2, and node 3 is running the rest.  Again, we hard-fail
> node 1 as described above and node 2 assumes the master role and the
> services stay where they are and continue to function.
>
> 3.  Repeating the same steps as above; node 1 is the master and not running
> any services, three services on node 2 and the rest on node three.  This
> time we perform a controlled shutdown of node 1 to "properly" remove it from
> the cluster (let's say we're doing a rolling patch of the OS on the nodes)
> with the following steps on the master node:
>  - Unmount any GFS file systems.
>  - service rgmanager stop; service gfs2 stop; service gfs stop  (clustat
> shows node1 Online but no rgmanager, as expected)
>  - fence_tool leave    (this removes node 1 from the fence group in the
> hopes that the other nodes don't try to fence it as it is rebooting)
>  - service clvmd stop
>  - cman_tool leave remove
>  - service qdiskd stop
>  - shutdown
> Everything appears normal until we execute the 'cman_tool leave remove'.  At
> that point the cluster log on node 2 and node 3 shows "Lost contact with
> quorum device" (we expect that) but also shows "Emergency stop of services"
> for all 12 services.  While access to the quorum device is restored almost
> immediately (node 2 takes over the master role), rgmanager is temporarily
> unavailable on nodes 2 and 3 while the cluster basically reconfigures
> itself, restarting all 12 services.  Eventually all 12 services properly
> restart (not necessarily on the original node they were on) and when node 1
> finishes rebooting, it properly rejoins itself to the cluster.  Node 2
> retains itself as Master.
>
> If I do the same tests as above and reboot a node that is NOT the master,
> the services remain where they are and the cluster does not reconfigure
> itself or restart any services.
>
> My questions are, Why does the cluster reconfigure itself and restart ALL
> services regardless of what node they are on when I do a controlled shutdown
> of the current Master node?  Do I have to hard-reset the Master node in an
> RHCS cluster so the remaining services don't get restarted?  Why does the
> cluster completely reconfigure itself when the Master node is 'properly'
> removed.
>
Hi, could you please show us your cluster.conf?

Regards,
Juanra
> Thanks for your help, and any suggestions would be appreciated.
>
> Greg Charles
> Mid Range Systems
>
> gcharles@xxxxxxx
>
>
> --
> Linux-cluster mailing list
> Linux-cluster@xxxxxxxxxx
> https://www.redhat.com/mailman/listinfo/linux-cluster
>

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster