Hello, I have a cluster with an Oracle service and rhel 5.4 nodes. Tipically one sets the "shutdown abort" of the DB as the default mechanism to close the service, to prevent stalling and accelerate switch of service itself in case of problems. The same approach is indeed used by the rhcs provided script, that I'm using. But sometimes we have to do maintenance on DB and use the strategy to freeze the service, manually stop DB, make modifications, manually start DB and unfreeze the service. This is useful when all the work is done on the same node carrying the service at that moment. Sometimes we need activities where we want to relocate the service too. And for the DBAs is desirable to clean shutdown the DB when there is a planned activity in place. With the same approach we do something like this: node1 with active service - freeze of the service: clusvcadm -Z SRV - maintenance activities with manual stop of service components (eg listener and Oracle instance) - shutdown of node1 shutdown -h now The shutdown takes about 2 minutes it is necessary to do a shutdown, because any command I tried, gave the error that the service was frozen and that I cannot run that command... - Wait on the survival node that: 1) it becomes master for the quorum disk otherwise it looses quorum Messagges in /var/log/qdiskd.log Jan 7 17:57:55 oracs1 qdiskd[7043]: <info> Node 2 shutdown Jan 7 17:57:55 oracs1 qdiskd[7043]: <debug> Making bid for master Jan 7 17:58:30 oracs1 qdiskd[7043]: <info> Assuming master role it takes about 1 minute, after shutdown of the other one 2) the cluster registers that the other node has gone Messages in /var/log/qdiskd.log Jan 7 18:00:35 oracs1 openais[7014]: [TOTEM] The token was lost in the OPERATIONAL state. Jan 7 18:00:35 oracs1 openais[7014]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes). Jan 7 18:00:35 oracs1 openais[7014]: [TOTEM] Transmit multicast socket send buffer size (320000 bytes). Jan 7 18:00:35 oracs1 openais[7014]: [TOTEM] entering GATHER state from 2. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] entering GATHER state from 0. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] Creating commit token because I am the rep. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] Saving state aru 24 high seq received 24 Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] Storing new sequence id for ring 4da34 Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] entering COMMIT state. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] entering RECOVERY state. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] position [0] member 192.168.16.1: Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] previous ring seq 318000 rep 192.168.16.1 Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] aru 24 high delivered 24 received flag 1 Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] Did not need to originate any messages in recovery. Jan 7 18:00:40 oracs1 openais[7014]: [TOTEM] Sending initial ORF token Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] CLM CONFIGURATION CHANGE Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] New Configuration: Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.1) Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] Members Left: Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.8) Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] Members Joined: Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] CLM CONFIGURATION CHANGE Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] New Configuration: Jan 7 18:00:40 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.1) Jan 7 18:00:41 oracs1 openais[7014]: [CLM ] Members Left: Jan 7 18:00:41 oracs1 openais[7014]: [CLM ] Members Joined: Jan 7 18:00:41 oracs1 openais[7014]: [SYNC ] This node is within the primary component and will provide service. Jan 7 18:00:41 oracs1 openais[7014]: [TOTEM] entering OPERATIONAL state. Jan 7 18:00:41 oracs1 openais[7014]: [CLM ] got nodejoin message 192.168.16.1 Jan 7 18:00:41 oracs1 openais[7014]: [CPG ] got joinlist message from node 1 It takes about 2 minutes (also due to timeouts set up because of qdisk, cman and multipath interactions needs) Total of about 5 minutes. And after this we can work on node2: - unfreeze of the service clusvcadm -U SRV This is not enough to have service start automatically. clustat gives service as "started" on the other node and remains so. Even if theoretically the node knows that the other one has left the cluster...... sort of bug in my opinion.... - disable of the service clusvcadm -d SRV - enable of the service clusvcadm -e SRV At this time the service suddenly starts as there is only one node alive and it is not necessary to specify the "-m " switch After a few minutes we can restart the node1 that will join the cluster again without problems: Messages in /var/log/qdiskd.log of the node2 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] entering GATHER state from 11. Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] Creating commit token because I am the rep. Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] Saving state aru 1c high seq received 1c Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] Storing new sequence id for ring 4da38 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] entering COMMIT state. Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] entering RECOVERY state. Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] position [0] member 192.168.16.1: Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] previous ring seq 318004 rep 192.168.16.1 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] aru 1c high delivered 1c received flag 1 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] position [1] member 192.168.16.8: Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] previous ring seq 318004 rep 192.168.16.8 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] aru a high delivered a received flag 1 Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] Did not need to originate any messages in recovery. Jan 7 18:12:50 oracs1 openais[7014]: [TOTEM] Sending initial ORF token Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] CLM CONFIGURATION CHANGE Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] New Configuration: Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.1) Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] Members Left: Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] Members Joined: Jan 7 18:12:50 oracs1 openais[7014]: [CLM ] CLM CONFIGURATION CHANGE Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] New Configuration: Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.1) Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.8) Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] Members Left: Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] Members Joined: Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] r(0) ip(192.168.16.8) Jan 7 18:12:51 oracs1 openais[7014]: [SYNC ] This node is within the primary component and will provide service. Jan 7 18:12:51 oracs1 openais[7014]: [TOTEM] entering OPERATIONAL state. Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] got nodejoin message 192.168.16.1 Jan 7 18:12:51 oracs1 openais[7014]: [CLM ] got nodejoin message 192.168.16.8 Jan 7 18:12:51 oracs1 openais[7014]: [CPG ] got joinlist message from node 1 Jan 7 18:13:20 oracs1 qdiskd[7043]: <debug> Node 2 is UP So the steps above let us clean switch the db with this limits: 1) it takes about 10-15 minutes to have the whole cluster up again with both nodes active 2) we have to shutdown one node and in case of clusters with more than only one service this could be a blocker at all of the approach itself. Any hints? Thanks, Gianluca -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster