Re: node is reboot during stop cluster application (oracle) and unable to relocate cluster application between nodes

John Ruemker <jruemker@xxxxxxxxxx> · Mon, 11 May 2009 12:33:14 -0400

On 05/11/2009 10:34 AM, Christopher Chen wrote:
I hope you're planning to expand to least a 3 node cluster before you go
into production. You know two node clusters are inherently unstable,
right?I assume you've read the architectural overview of how the cluster
suite achieves quorum.

A cluster requires (n/2)+1 to continue to operate. If you restart or
otherwise remove a machine from a two node cluster, you've lost quorum
and by definition you've dissolved your cluster while you're in that state.

Unless the special case two_node="1" is in use, and it is here:

       <cman expected_votes="1" two_node="1"/>

This allows for maintaining quorum when only one vote is present. 
Fencing is occurring because the link is dropping.  See below:

I'm pretty sure the behavior you are describing is proper.

Time flies like an arrow.
Fruit flies like a banana.

On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz@xxxxxxx
<mailto:CISPLengineer.hz@xxxxxxx>> wrote:

Hi,

I have configured two node cluster on redhat-5. now the problem is
when i relocate,restart or stop, running cluster service between nodes
(2 nos) ,the node get fenced and restart server . Other side, the
server who obtain cluster service leave the cluster and it's cluster
service (cman) stop automatically .so it is also fenced by other server.

I observed that , this problem occurred while stopping cluster service
(oracle).

Please help me to resolve this problem.

log messages and cluster.conf file are as given as below.
-------------------------
/etc/cluster/cluster.conf
-------------------------
<?xml version="1.0"?>
<cluster config_version="59" name="new_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="psfhost1" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="cluster1"/>
</method>
</fence>
</clusternode>
<clusternode name="psfhost2" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="cluster2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_ilo" hostname="ilonode1"
login="Administrator" name="cluster1" passwd="9M6X9CAU"/>
<fencedevice agent="fence_ilo" hostname="ilonode2"
login="Administrator" name="cluster2" passwd="ST69D87V"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="poy-cluster" ordered="0" restricted="0">
<failoverdomainnode name="psfhost1" priority="1"/>
<failoverdomainnode name="psfhost2" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.2.220.2" monitor_link="1"/>
<script file="/etc/init.d/httpd" name="httpd"/>
<fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0"
fsid="52427" fstype="ext3" mountpoint="/app" name="app" options=""
self_fence="0"/>
<fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0"
fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options=""
self_fence="0"/>
<fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0"
fsid="62307" fstype="ext3" mountpoint="/data" name="data" options=""
self_fence="0"/>
<fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0"
fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION"
options="" self_fence="0"/>
<script file="/etc/init.d/orcl" name="Oracle"/>
</resources>
<service autostart="0" name="oracle" recovery="relocate">
<fs ref="app"/>
<fs ref="opt"/>
<fs ref="data"/>
<fs ref="OPERATION"/>
<ip ref="10.2.220.2"/>
<script ref="Oracle"/>
</service>
</rm>
</cluster>

---------------- -------
/var/log/messages
-----------------------
following logs during relocate cluster service (oracle) between nodes.

_*Node-1*_

2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service
service:oracle
May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
seconds
May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal
May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
seconds
May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal
May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5
seconds
May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal
May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5
seconds
May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count
reached, running e2fsck is recommended
May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal
May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with
ordered data mode.
May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address
record for 10.2.220.2 on eth0.
May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt
(IP_ADD_MEMBERSHIP): Address already in use
May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses
May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service
service:oracle started
May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down

^^^^^
The cluster interconnect link went down, and thus this node could no 
longer communicate with the other node.

May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state
from 11.
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b
high seq received 1b
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id
for ring 90
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state.
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member
10.2.220.6:
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140
rep 10.2.220.6
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9
received flag 1
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member
10.2.220.7:
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136
rep 10.2.220.7
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered
1b received flag 1
May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to
originate any messages in recovery.
May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined:
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6)
May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the
primary component and will provide service.
May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL
state.
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
10.2.220.6
May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message
10.2.220.7
May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message
from node 2
May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps
full duplex, receive & transmit flow control ON
May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down
May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps
full duplex, receive & transmit flow control ON
May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1
May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete
(version 57 -> 59).
May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring
May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service
service:oracle
May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record
for 10.2.220.7 on eth0.
May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast
group on interface eth0.IPv4 with address 10.2.220.7.
May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast
group on interface eth0.IPv4 with address 10.2.220.2.
May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove
10.2.220.2
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state.
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member
127.0.0.1:
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144
rep 10.2.220.6
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered
31 received flag 1
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to
originate any messages in recovery.
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7)
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration:
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1)
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left:
May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined:
May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the
primary component and will provide service.
May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL state.
May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2
because it has rejoined the cluster without cman_tool join
May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2
because we rejoined the cluster without a full restart
May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104
May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at
0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4
May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting
May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104
May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting
May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting
May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died,
rebooting...
May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1
May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2
May 2 16:21:40 psfhost2 kernel: md: stopping all md devices.
May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not
completed yet!
May 2 16:24:55 psfhost2 syslogd 1.4.1: restart.
May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg
started.
May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5
(brewbuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxx
<mailto:brewbuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxx>) (gcc version 4.1.2
20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx <mailto:Linux-cluster@xxxxxxxxxx>
https://www.redhat.com/mailman/listinfo/linux-cluster

------------------------------------------------------------------------

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster