On Mon, May 11, 2009 at 9:33 AM, John Ruemker <jruemker@xxxxxxxxxx> wrote: > On 05/11/2009 10:34 AM, Christopher Chen wrote: >> >> I hope you're planning to expand to least a 3 node cluster before you go >> into production. You know two node clusters are inherently unstable, >> right?I assume you've read the architectural overview of how the cluster >> suite achieves quorum. >> >> A cluster requires (n/2)+1 to continue to operate. If you restart or >> otherwise remove a machine from a two node cluster, you've lost quorum >> and by definition you've dissolved your cluster while you're in that >> state. >> > > Unless the special case two_node="1" is in use, and it is here: > > <cman expected_votes="1" two_node="1"/> > > This allows for maintaining quorum when only one vote is present. Fencing is > occurring because the link is dropping. See below: I understand that that's an option, but how safe is it? Two node clusters scare me. > >> I'm pretty sure the behavior you are describing is proper. >> >> Time flies like an arrow. >> Fruit flies like a banana. >> >> On May 11, 2009, at 4:08, "Viral .D. Ahire" <CISPLengineer.hz@xxxxxxx >> <mailto:CISPLengineer.hz@xxxxxxx>> wrote: >> >>> Hi, >>> >>> I have configured two node cluster on redhat-5. now the problem is >>> when i relocate,restart or stop, running cluster service between nodes >>> (2 nos) ,the node get fenced and restart server . Other side, the >>> server who obtain cluster service leave the cluster and it's cluster >>> service (cman) stop automatically .so it is also fenced by other server. >>> >>> I observed that , this problem occurred while stopping cluster service >>> (oracle). >>> >>> Please help me to resolve this problem. >>> >>> log messages and cluster.conf file are as given as below. >>> ------------------------- >>> /etc/cluster/cluster.conf >>> ------------------------- >>> <?xml version="1.0"?> >>> <cluster config_version="59" name="new_cluster"> >>> <fence_daemon post_fail_delay="0" post_join_delay="3"/> >>> <clusternodes> >>> <clusternode name="psfhost1" nodeid="1" votes="1"> >>> <fence> >>> <method name="1"> >>> <device name="cluster1"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="psfhost2" nodeid="2" votes="1"> >>> <fence> >>> <method name="1"> >>> <device name="cluster2"/> >>> </method> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <cman expected_votes="1" two_node="1"/> >>> <fencedevices> >>> <fencedevice agent="fence_ilo" hostname="ilonode1" >>> login="Administrator" name="cluster1" passwd="9M6X9CAU"/> >>> <fencedevice agent="fence_ilo" hostname="ilonode2" >>> login="Administrator" name="cluster2" passwd="ST69D87V"/> >>> </fencedevices> >>> <rm> >>> <failoverdomains> >>> <failoverdomain name="poy-cluster" ordered="0" restricted="0"> >>> <failoverdomainnode name="psfhost1" priority="1"/> >>> <failoverdomainnode name="psfhost2" priority="1"/> >>> </failoverdomain> >>> </failoverdomains> >>> <resources> >>> <ip address="10.2.220.2" monitor_link="1"/> >>> <script file="/etc/init.d/httpd" name="httpd"/> >>> <fs device="/dev/cciss/c1d0p3" force_fsck="0" force_unmount="0" >>> fsid="52427" fstype="ext3" mountpoint="/app" name="app" options="" >>> self_fence="0"/> >>> <fs device="/dev/cciss/c1d0p4" force_fsck="0" force_unmount="0" >>> fsid="39388" fstype="ext3" mountpoint="/opt" name="opt" options="" >>> self_fence="0"/> >>> <fs device="/dev/cciss/c1d0p1" force_fsck="0" force_unmount="0" >>> fsid="62307" fstype="ext3" mountpoint="/data" name="data" options="" >>> self_fence="0"/> >>> <fs device="/dev/cciss/c1d0p2" force_fsck="0" force_unmount="0" >>> fsid="47234" fstype="ext3" mountpoint="/OPERATION" name="OPERATION" >>> options="" self_fence="0"/> >>> <script file="/etc/init.d/orcl" name="Oracle"/> >>> </resources> >>> <service autostart="0" name="oracle" recovery="relocate"> >>> <fs ref="app"/> >>> <fs ref="opt"/> >>> <fs ref="data"/> >>> <fs ref="OPERATION"/> >>> <ip ref="10.2.220.2"/> >>> <script ref="Oracle"/> >>> </service> >>> </rm> >>> </cluster> >>> >>> >>> >>> >>> >>> >>> >>> ---------------- ------- >>> /var/log/messages >>> ----------------------- >>> following logs during relocate cluster service (oracle) between nodes. >>> >>> _*Node-1*_ >>> >>> 2 16:17:58 psfhost2 clurgmgrd[3793]: <notice> Starting stopped service >>> service:oracle >>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5 >>> seconds >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count >>> reached, running e2fsck is recommended >>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p3, internal journal >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with >>> ordered data mode. >>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5 >>> seconds >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count >>> reached, running e2fsck is recommended >>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p4, internal journal >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with >>> ordered data mode. >>> May 2 16:17:58 psfhost2 kernel: kjournald starting. Commit interval 5 >>> seconds >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs warning: maximal mount count >>> reached, running e2fsck is recommended >>> May 2 16:17:58 psfhost2 kernel: EXT3 FS on cciss/c1d0p1, internal journal >>> May 2 16:17:58 psfhost2 kernel: EXT3-fs: mounted filesystem with >>> ordered data mode. >>> May 2 16:17:59 psfhost2 kernel: kjournald starting. Commit interval 5 >>> seconds >>> May 2 16:17:59 psfhost2 kernel: EXT3-fs warning: maximal mount count >>> reached, running e2fsck is recommended >>> May 2 16:17:59 psfhost2 kernel: EXT3 FS on cciss/c1d0p2, internal journal >>> May 2 16:17:59 psfhost2 kernel: EXT3-fs: mounted filesystem with >>> ordered data mode. >>> May 2 16:17:59 psfhost2 avahi-daemon[3661]: Registering new address >>> record for 10.2.220.2 on eth0. >>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: setsockopt >>> (IP_ADD_MEMBERSHIP): Address already in use >>> May 2 16:18:00 psfhost2 in.rdiscd[5945]: Failed joining addresses >>> May 2 16:18:11 psfhost2 clurgmgrd[3793]: <notice> Service >>> service:oracle started >>> May 2 16:19:17 psfhost2 kernel: bnx2: eth1 NIC Link is Down > > ^^^^^ > The cluster interconnect link went down, and thus this node could no longer > communicate with the other node. > > >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering GATHER state >>> from 11. >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Saving state aru 1b >>> high seq received 1b >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Storing new sequence id >>> for ring 90 >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering COMMIT state. >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state. >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [0] member >>> 10.2.220.6: >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 140 >>> rep 10.2.220.6 >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 9 high delivered 9 >>> received flag 1 >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] position [1] member >>> 10.2.220.7: >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] previous ring seq 136 >>> rep 10.2.220.7 >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] aru 1b high delivered >>> 1b received flag 1 >>> May 2 16:19:26 psfhost2 openais[3275]: [TOTEM] Did not need to >>> originate any messages in recovery. >>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE >>> May 2 16:19:26 psfhost2 openais[3275]: [CLM ] New Configuration: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7) >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] New Configuration: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6) >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7) >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Left: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] Members Joined: >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.6) >>> May 2 16:19:27 psfhost2 openais[3275]: [SYNC ] This node is within the >>> primary component and will provide service. >>> May 2 16:19:27 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL >>> state. >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message >>> 10.2.220.6 >>> May 2 16:19:27 psfhost2 openais[3275]: [CLM ] got nodejoin message >>> 10.2.220.7 >>> May 2 16:19:27 psfhost2 openais[3275]: [CPG ] got joinlist message >>> from node 2 >>> May 2 16:19:29 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 1000 Mbps >>> full duplex, receive & transmit flow control ON >>> May 2 16:19:31 psfhost2 kernel: bnx2: eth1 NIC Link is Down >>> May 2 16:19:35 psfhost2 kernel: bnx2: eth1 NIC Link is Up, 100 Mbps >>> full duplex, receive & transmit flow control ON >>> May 2 16:19:42 psfhost2 kernel: dlm: connecting to 1 >>> May 2 16:20:36 psfhost2 ccsd[3265]: Update of cluster.conf complete >>> (version 57 -> 59). >>> May 2 16:20:43 psfhost2 clurgmgrd[3793]: <notice> Reconfiguring >>> May 2 16:21:15 psfhost2 clurgmgrd[3793]: <notice> Stopping service >>> service:oracle >>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Withdrawing address record >>> for 10.2.220.7 on eth0. >>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Leaving mDNS multicast >>> group on interface eth0.IPv4 with address 10.2.220.7. >>> May 2 16:21:25 psfhost2 avahi-daemon[3661]: Joining mDNS multicast >>> group on interface eth0.IPv4 with address 10.2.220.2. >>> May 2 16:21:25 psfhost2 clurgmgrd: [3793]: <err> Failed to remove >>> 10.2.220.2 >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering RECOVERY state. >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] position [0] member >>> 127.0.0.1: >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] previous ring seq 144 >>> rep 10.2.220.6 >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] aru 31 high delivered >>> 31 received flag 1 >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Did not need to >>> originate any messages in recovery. >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] Sending initial ORF token >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration: >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1) >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left: >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(10.2.220.7) >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined: >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] CLM CONFIGURATION CHANGE >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] New Configuration: >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] r(0) ip(127.0.0.1) >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Left: >>> May 2 16:21:40 psfhost2 openais[3275]: [CLM ] Members Joined: >>> May 2 16:21:40 psfhost2 openais[3275]: [SYNC ] This node is within the >>> primary component and will provide service. >>> May 2 16:21:40 psfhost2 openais[3275]: [TOTEM] entering OPERATIONAL >>> state. >>> May 2 16:21:40 psfhost2 openais[3275]: [MAIN ] Killing node psfhost2 >>> because it has rejoined the cluster without cman_tool join >>> May 2 16:21:40 psfhost2 openais[3275]: [CMAN ] cman killed by node 2 >>> because we rejoined the cluster without a full restart >>> May 2 16:21:40 psfhost2 fenced[3291]: cman_get_nodes error -1 104 >>> May 2 16:21:40 psfhost2 kernel: clurgmgrd[3793]: segfault at >>> 0000000000000000 rip 0000000000408c4a rsp 00007fff3c4a9e20 error 4 >>> May 2 16:21:40 psfhost2 fenced[3291]: cluster is down, exiting >>> May 2 16:21:40 psfhost2 groupd[3283]: cman_get_nodes error -1 104 >>> May 2 16:21:40 psfhost2 dlm_controld[3297]: cluster is down, exiting >>> May 2 16:21:40 psfhost2 gfs_controld[3303]: cluster is down, exiting >>> May 2 16:21:40 psfhost2 clurgmgrd[3792]: <crit> Watchdog: Daemon died, >>> rebooting... >>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 1 >>> May 2 16:21:40 psfhost2 kernel: dlm: closing connection to node 2 >>> May 2 16:21:40 psfhost2 kernel: md: stopping all md devices. >>> May 2 16:21:41 psfhost2 kernel: uhci_hcd 0000:01:04.4: HCRESET not >>> completed yet! >>> May 2 16:24:55 psfhost2 syslogd 1.4.1: restart. >>> May 2 16:24:55 psfhost2 kernel: klogd 1.4.1, log source = /proc/kmsg >>> started. >>> May 2 16:24:55 psfhost2 kernel: Linux version 2.6.18-53.el5 >>> (brewbuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxx >>> <mailto:brewbuilder@xxxxxxxxxxxxxxxxxxxxxxxxxxx>) (gcc version 4.1.2 >>> 20070626 (Red Hat 4.1.2-14)) #1 SMP Wed Oct >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx <mailto:Linux-cluster@xxxxxxxxxx> >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> ------------------------------------------------------------------------ >> >> -- >> Linux-cluster mailing list >> Linux-cluster@xxxxxxxxxx >> https://www.redhat.com/mailman/listinfo/linux-cluster > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Chris Chen <muffaleta@xxxxxxxxx> "I want the kind of six pack you can't drink." -- Micah -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster