Re: Strange corosync fail ...

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 24 Feb 2014 09:07:23 +0100

cluster lab napsal(a):
On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
On 2/24/2014 8:47 AM, cluster lab wrote:
On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
On 02/23/2014 12:59 PM, cluster lab wrote:

On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx
<mailto:fdinitto@xxxxxxxxxx>> wrote:

     On 02/22/2014 11:10 AM, cluster lab wrote:
     > hi,
     >
     > At the middle of cluster activity i received this messages: (cluster
     > is 3 node with SAN ... GFS2 filesystem)

     OS? version of the packages? cluster.conf

OS: SL (Scientific Linux 6),

Packages:
kernel-2.6.32-71.29.1.el6.x86_64
rgmanager-3.0.12.1-12.el6.x86_64
cman-3.0.12-23.el6.x86_64
corosynclib-1.2.3-21.el6.x86_64
corosync-1.2.3-21.el6.x86_64

^^^^ This is really really really corosync for SL 6.0 GOLD. It is 
unsupported and known to be pretty buggy (if problem you hit is only one 
you hit, you are pretty lucky guy).

Please update to something little less ancient.

Regards,
  Honza

Cluster.conf:

<?xml version="1.0"?>
<cluster config_version="224" name="USBackCluster">
         <fence_daemon clean_start="0" post_fail_delay="10"
post_join_delay="3"/>
         <clusternodes>
                 <clusternode name="USBack-prox1" nodeid="1" votes="1">
                         <fence>
                                 <method name="ilo">
                                         <device name="USBack-prox1-ilo"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="USBack-prox2" nodeid="2" votes="1">
                         <fence>
                                 <method name="ilo">
                                         <device name="USBack-prox2-ilo"/>
                                 </method>
                         </fence>
                 </clusternode>
                 <clusternode name="USBack-prox3" nodeid="3" votes="1">
                         <fence>
                                 <method name="ilo">
                                         <device name="USBack-prox3-ilo"/>
                                 </method>
                         </fence>
                 </clusternode>
         </clusternodes>
         <cman/>
         <fencedevices>
                 ... fence config ...
         </fencedevices>
         <rm>
                 <failoverdomains>
                         <failoverdomain name="VMS-Area" nofailback="0"
ordered="0" restricted="0">
                                 <failoverdomainnode name="USBack-prox1"
priority="1"/>
                                 <failoverdomainnode name="USBack-prox2"
priority="1"/>
                                 <failoverdomainnode name="USBack-prox3"
priority="1"/>
                         </failoverdomain>
                 </failoverdomains>
                 <resources>
     ....

     >
     > log messages on USBAck-prox2:
     >
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] Members[2]: 2 3
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A processor
     > joined or left the membership and a new membership was formed.
     > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change:
     USBack-prox1 DOWN
     > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to node 1
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
     > left_list: 1
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
     > left_list: 1
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
     > from node r(0) ip(--.--.--.22)
     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed service
     > synchronization, ready to provide service.
     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
     > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire journal
     > lock...
     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
     > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire journal
     > lock...
     > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node USBack-prox1
     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 dev 0.0
     > agent fence_ipmilan result: error from agent
     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 failed
     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
     node
     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
     node

     ^^^ good hint here. something is off.

?

It means that there is something in that network that tries to connect
to the cluster node, without being a cluster node.

Fabio

There is no node in cluster network other than cluster nodes,
I think "node #1" retries to reconnect dlm and can't.

There is two try on node#1 :
Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3
Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2

Can you please check that iptables are set correctly and that traffic
between nodes is not behind NAT?

Fabio

IPtable is disable,
Traffic between cluster nodes is flat. without any NAT

Logs on Node#1:
Feb 21 13:06:47 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
failed, forming new configuration.
Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3
Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 3
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum lUS,
blocking activity
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
within the non-primary component and will NOT provide any services.
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[1]: 1
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum
regained, resuming activity
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
within the primary component and will provide service.
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
received left_list: 2
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
received left_list: 0
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
received left_list: 0
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] chosen
downlist from node r(0) ip(--.--.--.21)
Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [MAIN  ] Completed
service synchronization, ready to provide service.

Logs on Node#2:
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [QUORUM] Members[2]: 2 3
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 DOWN
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
received left_list: 1
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
received left_list: 1
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] chosen
downlist from node r(0) ip(--.--.--.22)
Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1
Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to USBack-prox2
Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal
lock...
Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal
lock...
Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
joined or left the membership and a new membership was formed.
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 UP
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
received left_list: 2
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
received left_list: 0
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
received left_list: 0
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] chosen
downlist from node r(0) ip(--.--.--.21)
Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
service synchronization, ready to provide service.
Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
handle 4e6afb6600000000 protocol
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 3a95f87400000000 protocol
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 1e7ff52100000001 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 22221a7000000002 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 419ac24100000003 start
Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
handle 440badfc00000001 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 3804823e00000004 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 2463b9ea00000005 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 22221a7000000002 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 419ac24100000003 start
Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined
error 12 handle 440badfc00000000 protocol
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 3804823e00000004 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 2463b9ea00000005 start
Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
error 12 handle 1e7ff52100000001 start

     Fabio

     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A processor
     > joined or left the membership and a new membership was formed.
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
     1 2 3
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
     1 2 3
     > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change:
     USBack-prox1 UP
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
     > left_list: 2
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
     > left_list: 0
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
     > left_list: 0
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
     > from node r(0) ip(--.--.--.21)
     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed service
     > synchronization, ready to provide service.
     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
     > error 12 handle 3a95f87400000000 protocol
     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
     > error 12 handle 1e7ff52100000001 start
     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
     > error 12 handle 22221a7000000002 start
     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
     > error 12 handle 419ac24100000003 start
     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
     > error 12 handle 3804823e00000004 start
     >
     >
     > -------------------------------------------------
     > Then GFS2 generates error logs (Activities blocked).
     >
     > Logs of cisco switch (Time is UTC):
     >
     > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on Interface
     > GigabitEthernet0/11, changed state to down
     > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on Interface
     > GigabitEthernet0/4, changed state to down
     > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
     > changed state to down
     > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
     > changed state to down
     > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
     > changed state to up
     > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
     > changed state to up
     > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on Interface
     > GigabitEthernet0/11, changed state to up
     > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on Interface
     > GigabitEthernet0/4, changed state to up
     > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on Interface
     > GigabitEthernet0/11, changed state to down
     > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
     > changed state to down
     > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
     > changed state to up
     > _______________________________________________
     > discuss mailing list
     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
     > http://lists.corosync.org/mailman/listinfo/discuss
     >

     _______________________________________________
     discuss mailing list
     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
     http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss