On Tue, Oct 26, 2010 at 2:52 PM, <gcharles@xxxxxxx> wrote: > Hello, > > Was wondering if anyone else has ever run into this. We have a three-node > RHCS cluster: > > Three Proliant DL380-G6s, 48G memory > Dual network, power, QLogic HBAs for redundancy > EMC SAN > RHEL 5.5 kernel 2.6.18-194.el5 > > All three in an RHCS cluster, 12 Oracle database services. The cluster > itself runs fine under normal conditions, and all failovers function as > expected. There is only one failover domain configured, and all three nodes > are members of that domain. Four of the Oracle database services contain > GFS2 file systems; the rest are ext3. > > The problem is when we attempt a controlled shutdown of the current master > node. We have tested in the following situations: > > 1. Node 1 is the current master and not running any services. Node 2 is > also not running any services. Node 3 is running all 12 services. We > hard-fail node 1 (by logging into the ILO and clicking on "Reset" in power > management) and node 2 immediately takes over the master role and the > services stay where they are and continue to function. I believe this is > the expected behavior. > > 2. Node 1 is the current master and not running any services. Three > services are on node 2, and node 3 is running the rest. Again, we hard-fail > node 1 as described above and node 2 assumes the master role and the > services stay where they are and continue to function. > > 3. Repeating the same steps as above; node 1 is the master and not running > any services, three services on node 2 and the rest on node three. This > time we perform a controlled shutdown of node 1 to "properly" remove it from > the cluster (let's say we're doing a rolling patch of the OS on the nodes) > with the following steps on the master node: > - Unmount any GFS file systems. > - service rgmanager stop; service gfs2 stop; service gfs stop (clustat > shows node1 Online but no rgmanager, as expected) > - fence_tool leave (this removes node 1 from the fence group in the > hopes that the other nodes don't try to fence it as it is rebooting) > - service clvmd stop > - cman_tool leave remove > - service qdiskd stop > - shutdown > Everything appears normal until we execute the 'cman_tool leave remove'. At > that point the cluster log on node 2 and node 3 shows "Lost contact with > quorum device" (we expect that) but also shows "Emergency stop of services" > for all 12 services. While access to the quorum device is restored almost > immediately (node 2 takes over the master role), rgmanager is temporarily > unavailable on nodes 2 and 3 while the cluster basically reconfigures > itself, restarting all 12 services. Eventually all 12 services properly > restart (not necessarily on the original node they were on) and when node 1 > finishes rebooting, it properly rejoins itself to the cluster. Node 2 > retains itself as Master. > > If I do the same tests as above and reboot a node that is NOT the master, > the services remain where they are and the cluster does not reconfigure > itself or restart any services. > > My questions are, Why does the cluster reconfigure itself and restart ALL > services regardless of what node they are on when I do a controlled shutdown > of the current Master node? Do I have to hard-reset the Master node in an > RHCS cluster so the remaining services don't get restarted? Why does the > cluster completely reconfigure itself when the Master node is 'properly' > removed. > Hi, could you please show us your cluster.conf? Regards, Juanra > Thanks for your help, and any suggestions would be appreciated. > > Greg Charles > Mid Range Systems > > gcharles@xxxxxxx > > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster