Cluster failover troubleshooting

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Please help me figure out why this cluster failed over. This has happened several times in the past month, while previously it had been quite stable. What can trigger the corosync " [TOTEM ] A processor failed, forming new configuration. " message? By all appearances the primary server was functioning properly until it was fenced by the secondary.

I've got cluster3 running on debian lenny 2.6.30-1-amd64
ii  openais        1.0.0-3local1  Standards-based cluster framework (daemon an
ii  corosync       1.0.0-4        Standards-based cluster framework (daemon an
ii  rgmanager      3.0.0-1~agx0lo clustered resource group manager
ii  cman           3.0.0-1~agx0lo cluster manager


A bunch of successful status checks on the active server, nicks, leading up to:

Jan 21 04:28:08 wonder corosync[2856]: [TOTEM ] A processor failed, forming new configuration.
Jan 21 04:28:09 wonder qdiskd[2873]: Writing eviction notice for node 2
Jan 21 04:28:10 wonder qdiskd[2873]: Node 2 evicted
Jan 21 04:28:11 nicks corosync[2991]: [CMAN ] lost contact with quorum device
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.21)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] This node is within the primary component and will provide service.
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] Members[1]:
Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] 1
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20)
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left:
Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined:
Jan 21 04:28:12 wonder corosync[2856]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Jan 21 04:28:12 wonder corosync[2856]: [MAIN ] Completed service synchronization, ready to provide service.
Jan 21 04:28:12 wonder rgmanager[3578]: State change: nicks-p DOWN
Jan 21 04:28:13 wonder kernel: [1298595.738213] dlm: closing connection to node 2
Jan 21 04:28:13 wonder fenced[3206]: fencing node nicks-p
Jan 21 04:28:12 nicks corosync[2991]: [QUORUM] This node is within the primary component and will provide service.
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] Members[2]:
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 1
Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 2
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success
Jan 21 04:28:17 wonder rgmanager[3578]: Taking over service service:MailHost from down member nicks-p
Jan 21 04:28:17 wonder bash[13236]: Unknown file system type 'ext4' for device /dev/dm-0. Assuming fsck is required.
Jan 21 04:28:17 wonder bash[13259]: Running fsck on /dev/dm-0
Jan 21 04:28:18 wonder bash[13284]: mounting /dev/dm-0 on /home
Jan 21 04:28:18 wonder bash[13306]: mount -t ext4 -o defaults,noatime,nodiratime /dev/dm-0 /home
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13335]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13368]: mounting /dev/dm-1 on /var/cluster
Jan 21 04:28:19 wonder bash[13390]: mount -t ext3 -o defaults /dev/dm-1 /var/cluster
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:19 wonder bash[13415]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin
Jan 21 04:28:20 wonder bash[13467]: Link for eth0: Detected
Jan 21 04:28:20 wonder bash[13489]: Adding IPv4 address 172.25.16.58/22 to eth0
Jan 21 04:28:20 wonder bash[13513]: Sending gratuitous ARP: 172.25.16.58 00:30:48:c6:df:ce brd ff:ff:ff:ff:ff:ff
Jan 21 04:28:21 wonder bash[13551]: Executing /etc/cluster/MailHost-misc-early start
Jan 21 04:28:21 wonder bash[13606]: Executing /etc/cluster/saslauthd-cluster start
Jan 21 04:28:21 wonder bash[13679]: Executing /etc/cluster/postfix-cluster start
Jan 21 04:28:22 wonder bash[13788]: Executing /etc/cluster/dovecot-wrapper start
Jan 21 04:28:22 wonder bash[13850]: Executing /etc/cluster/mailman-wrapper start
Jan 21 04:28:23 wonder bash[13901]: Executing /etc/cluster/apache2-mailhost start


<?xml version="1.0"?>
<cluster name="alpha" config_version="44">

<cman two_node="0" expected_votes="3">
</cman>

<clusternodes>
<clusternode name="wonder-p" votes="1" nodeid="1">
        <fence>
                <method name="single">
                        <device name="pwr01" option="off"/>
                        <device name="pwr02" option="off"/>
                        <device name="pwr01" option="on"/>
                        <device name="pwr02" option="on"/>
                </method>
        </fence>
</clusternode>
<clusternode name="nicks-p" votes="1" nodeid="2">
        <fence>
                <method name="single">
                        <device name="pwr03" option="off"/>
                        <device name="pwr04" option="off"/>
                        <device name="pwr03" option="on"/>
                        <device name="pwr04" option="on"/>
                </method>
        </fence>
</clusternode>
</clusternodes>

<quorumd interval="1" tko="10" votes="1" label="quorumdisk">
        <heuristic program="ping 172.25.19.254 -c1 -t1" score="1" interval="2" tko="3"/>
</quorumd>

<fence_daemon post_join_delay="20">
</fence_daemon>

<fencedevices>
        <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="4" name="pwr01" udpport="161" />
        <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="4" name="pwr02" udpport="161" />
        <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="3" name="pwr03" udpport="161" />
        <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="3" name="pwr04" udpport="161" />
</fencedevices>

<rm>

   <failoverdomains>
           <failoverdomain name="mailcluster" restricted="1" ordered="0" >
                <failoverdomainnode name="wonder-p" priority="1"/>
                <failoverdomainnode name="nicks-p" priority="1"/>
           </failoverdomain>
   </failoverdomains>

   <service name="MailHost" autostart="1" domain="mailcluster" >
           <script name="MailHost-early" file="/etc/cluster/MailHost-misc-early" />
           <fs name="mailhome" mountpoint="/home" device="/dev/dm-0" fstype="ext4" force_unmount="1" active_monitor="1" options="defaults,noatime,nodiratime" />
           <fs name="mailcluster" mountpoint="/var/cluster" device="/dev/dm-1" fstype="ext3" force_unmount="1" active_monitor="1" options="defaults" />
           <ip address="172.25.16.58" monitor_link="1" />
           <script name="saslauthd" file="/etc/cluster/saslauthd-cluster" />
           <script name="postfix" file="/etc/cluster/postfix-cluster" />
           <script name="dovecot" file="/etc/cluster/dovecot-wrapper" __independent_subtree="1" />
           <script name="mailman" file="/etc/cluster/mailman-wrapper" __independent_subtree="1" />
           <script name="apache2-mailhost" file="/etc/cluster/apache2-mailhost" __independent_subtree="1" />
           <script name="usermin" file="/etc/init.d/usermin-sb" __independent_subtree="1" />
           <script name="MailHost-late" file="/etc/cluster/MailHost-misc-late" />
   </service>

</rm>
</cluster>


Thanks

Chris
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster

[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux