Please help me figure out why this cluster failed over. This has
happened several times in the past month, while previously it had been
quite stable. What can trigger the corosync " [TOTEM ] A processor
failed, forming new configuration.
" message? By all appearances the primary server was functioning
properly until it was fenced by the secondary. I've got cluster3 running on debian lenny 2.6.30-1-amd64 ii openais 1.0.0-3local1 Standards-based cluster framework (daemon an ii corosync 1.0.0-4 Standards-based cluster framework (daemon an ii rgmanager 3.0.0-1~agx0lo clustered resource group manager ii cman 3.0.0-1~agx0lo cluster manager A bunch of successful status checks on the active server, nicks, leading up to: Jan 21 04:28:08 wonder corosync[2856]: [TOTEM ] A processor failed, forming new configuration. Jan 21 04:28:09 wonder qdiskd[2873]: Writing eviction notice for node 2 Jan 21 04:28:10 wonder qdiskd[2873]: Node 2 evicted Jan 21 04:28:11 nicks corosync[2991]: [CMAN ] lost contact with quorum device Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration: Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20) Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left: Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.21) Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined: Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] This node is within the primary component and will provide service. Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] Members[1]: Jan 21 04:28:12 wonder corosync[2856]: [QUORUM] 1 Jan 21 04:28:12 wonder corosync[2856]: [CLM ] CLM CONFIGURATION CHANGE Jan 21 04:28:12 wonder corosync[2856]: [CLM ] New Configuration: Jan 21 04:28:12 wonder corosync[2856]: [CLM ] r(0) ip(192.168.255.20) Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Left: Jan 21 04:28:12 wonder corosync[2856]: [CLM ] Members Joined: Jan 21 04:28:12 wonder corosync[2856]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Jan 21 04:28:12 wonder corosync[2856]: [MAIN ] Completed service synchronization, ready to provide service. Jan 21 04:28:12 wonder rgmanager[3578]: State change: nicks-p DOWN Jan 21 04:28:13 wonder kernel: [1298595.738213] dlm: closing connection to node 2 Jan 21 04:28:13 wonder fenced[3206]: fencing node nicks-p Jan 21 04:28:12 nicks corosync[2991]: [QUORUM] This node is within the primary component and will provide service. Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] Members[2]: Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 1 Jan 21 04:28:13 nicks corosync[2991]: [QUORUM] 2 Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success Jan 21 04:28:16 wonder fenced[3206]: fence nicks-p success Jan 21 04:28:17 wonder rgmanager[3578]: Taking over service service:MailHost from down member nicks-p Jan 21 04:28:17 wonder bash[13236]: Unknown file system type 'ext4' for device /dev/dm-0. Assuming fsck is required. Jan 21 04:28:17 wonder bash[13259]: Running fsck on /dev/dm-0 Jan 21 04:28:18 wonder bash[13284]: mounting /dev/dm-0 on /home Jan 21 04:28:18 wonder bash[13306]: mount -t ext4 -o defaults,noatime,nodiratime /dev/dm-0 /home Jan 21 04:28:19 wonder bash[13335]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin Jan 21 04:28:19 wonder bash[13335]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin Jan 21 04:28:19 wonder bash[13368]: mounting /dev/dm-1 on /var/cluster Jan 21 04:28:19 wonder bash[13390]: mount -t ext3 -o defaults /dev/dm-1 /var/cluster Jan 21 04:28:19 wonder bash[13415]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin Jan 21 04:28:19 wonder bash[13415]: quotaon not found in /bin:/sbin:/usr/bin:/usr/sbin Jan 21 04:28:20 wonder bash[13467]: Link for eth0: Detected Jan 21 04:28:20 wonder bash[13489]: Adding IPv4 address 172.25.16.58/22 to eth0 Jan 21 04:28:20 wonder bash[13513]: Sending gratuitous ARP: 172.25.16.58 00:30:48:c6:df:ce brd ff:ff:ff:ff:ff:ff Jan 21 04:28:21 wonder bash[13551]: Executing /etc/cluster/MailHost-misc-early start Jan 21 04:28:21 wonder bash[13606]: Executing /etc/cluster/saslauthd-cluster start Jan 21 04:28:21 wonder bash[13679]: Executing /etc/cluster/postfix-cluster start Jan 21 04:28:22 wonder bash[13788]: Executing /etc/cluster/dovecot-wrapper start Jan 21 04:28:22 wonder bash[13850]: Executing /etc/cluster/mailman-wrapper start Jan 21 04:28:23 wonder bash[13901]: Executing /etc/cluster/apache2-mailhost start <?xml version="1.0"?> <cluster name="alpha" config_version="44"> <cman two_node="0" expected_votes="3"> </cman> <clusternodes> <clusternode name="wonder-p" votes="1" nodeid="1"> <fence> <method name="single"> <device name="pwr01" option="off"/> <device name="pwr02" option="off"/> <device name="pwr01" option="on"/> <device name="pwr02" option="on"/> </method> </fence> </clusternode> <clusternode name="nicks-p" votes="1" nodeid="2"> <fence> <method name="single"> <device name="pwr03" option="off"/> <device name="pwr04" option="off"/> <device name="pwr03" option="on"/> <device name="pwr04" option="on"/> </method> </fence> </clusternode> </clusternodes> <quorumd interval="1" tko="10" votes="1" label="quorumdisk"> <heuristic program="ping 172.25.19.254 -c1 -t1" score="1" interval="2" tko="3"/> </quorumd> <fence_daemon post_join_delay="20"> </fence_daemon> <fencedevices> <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="4" name="pwr01" udpport="161" /> <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="4" name="pwr02" udpport="161" /> <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-2" port="3" name="pwr03" udpport="161" /> <fencedevice agent="fence_apc_snmp" ipaddr="pdu-paul-2-3" port="3" name="pwr04" udpport="161" /> </fencedevices> <rm> <failoverdomains> <failoverdomain name="mailcluster" restricted="1" ordered="0" > <failoverdomainnode name="wonder-p" priority="1"/> <failoverdomainnode name="nicks-p" priority="1"/> </failoverdomain> </failoverdomains> <service name="MailHost" autostart="1" domain="mailcluster" > <script name="MailHost-early" file="/etc/cluster/MailHost-misc-early" /> <fs name="mailhome" mountpoint="/home" device="/dev/dm-0" fstype="ext4" force_unmount="1" active_monitor="1" options="defaults,noatime,nodiratime" /> <fs name="mailcluster" mountpoint="/var/cluster" device="/dev/dm-1" fstype="ext3" force_unmount="1" active_monitor="1" options="defaults" /> <ip address="172.25.16.58" monitor_link="1" /> <script name="saslauthd" file="/etc/cluster/saslauthd-cluster" /> <script name="postfix" file="/etc/cluster/postfix-cluster" /> <script name="dovecot" file="/etc/cluster/dovecot-wrapper" __independent_subtree="1" /> <script name="mailman" file="/etc/cluster/mailman-wrapper" __independent_subtree="1" /> <script name="apache2-mailhost" file="/etc/cluster/apache2-mailhost" __independent_subtree="1" /> <script name="usermin" file="/etc/init.d/usermin-sb" __independent_subtree="1" /> <script name="MailHost-late" file="/etc/cluster/MailHost-misc-late" /> </service> </rm> </cluster> Thanks Chris |
-- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster