corosync ring failure

"C. Handel" <christoph@xxxxxxxxxxxxxx> · Wed, 23 Jul 2014 17:53:56 +0200

hi,

i run a cluster with two corosync rings. One of the rings is marked
faulty every fourty seconds, to immediately recover a second later.
the other ring is stable

i have no idea how i should debug this.

we are running sl6.5 with pacemaker 1.1.10, cman 3.0.12, corosync 1.4.1
cluster consists of three machines. Ring1 is running on 10gigbit
interfaces, Ring0 on 1gigibit interfaces. Both rings don't leave their
respective switch.

corosync communication is udpu, rrp_mode is passive

cluster.conf:

<cluster config_version="30" name="aslfile">

<cman transport="udpu">
</cman>

<fence_daemon post_join_delay="120" post_fail_delay="30"/>

<fencedevices>
        <fencedevice name="pcmk" agent="fence_pcmk" action="off"/>
</fencedevices>

<quorumd
   cman_label="qdisk"
   device="/dev/mapper/mpath-091quorump1"
   min_score="1"
   votes="2"
   >
</quorumd>

<clusternodes>
<clusternode name="asl430m90" nodeid="430">
        <altname name="asl430"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl430m90"/>
                </method>
        </fence>
</clusternode>
<clusternode name="asl431m90" nodeid="431">
        <altname name="asl431"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl431m90"/>
                </method>
        </fence>
</clusternode>
<clusternode name="asl432m90" nodeid="432">
        <altname name="asl432"/>
        <fence>
                <method name="pcmk-redirect">
                        <device name="pcmk" port="asl432m90"/>
                </method>
        </fence>
</clusternode>
</clusternodes>
</cluster>

syslog

Jul 23 17:48:34 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:48:35 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:14 asl431 corosync[3254]:   [TOTEM ] Marking ringid 1
interface 140.181.134.212 FAULTY
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1
Jul 23 17:49:15 asl431 corosync[3254]:   [TOTEM ] Automatically recovered ring 1

Greetings
   Christoph

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster