Re: Strange corosync fail ...

"Fabio M. Di Nitto" <fdinitto@xxxxxxxxxx> · Mon, 24 Feb 2014 08:53:34 +0100

On 2/24/2014 8:47 AM, cluster lab wrote:
> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>> On 02/23/2014 12:59 PM, cluster lab wrote:
>>>
>>>
>>>
>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx
>>> <mailto:fdinitto@xxxxxxxxxx>> wrote:
>>>
>>>     On 02/22/2014 11:10 AM, cluster lab wrote:
>>>     > hi,
>>>     >
>>>     > At the middle of cluster activity i received this messages: (cluster
>>>     > is 3 node with SAN ... GFS2 filesystem)
>>>
>>>     OS? version of the packages? cluster.conf
>>>
>>>
>>> OS: SL (Scientific Linux 6),
>>>
>>> Packages:
>>> kernel-2.6.32-71.29.1.el6.x86_64
>>> rgmanager-3.0.12.1-12.el6.x86_64
>>> cman-3.0.12-23.el6.x86_64
>>> corosynclib-1.2.3-21.el6.x86_64
>>> corosync-1.2.3-21.el6.x86_64
>>>
>>> Cluster.conf:
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="224" name="USBackCluster">
>>>         <fence_daemon clean_start="0" post_fail_delay="10"
>>> post_join_delay="3"/>
>>>         <clusternodes>
>>>                 <clusternode name="USBack-prox1" nodeid="1" votes="1">
>>>                         <fence>
>>>                                 <method name="ilo">
>>>                                         <device name="USBack-prox1-ilo"/>
>>>                                 </method>
>>>                         </fence>
>>>                 </clusternode>
>>>                 <clusternode name="USBack-prox2" nodeid="2" votes="1">
>>>                         <fence>
>>>                                 <method name="ilo">
>>>                                         <device name="USBack-prox2-ilo"/>
>>>                                 </method>
>>>                         </fence>
>>>                 </clusternode>
>>>                 <clusternode name="USBack-prox3" nodeid="3" votes="1">
>>>                         <fence>
>>>                                 <method name="ilo">
>>>                                         <device name="USBack-prox3-ilo"/>
>>>                                 </method>
>>>                         </fence>
>>>                 </clusternode>
>>>         </clusternodes>
>>>         <cman/>
>>>         <fencedevices>
>>>                 ... fence config ...
>>>         </fencedevices>
>>>         <rm>
>>>                 <failoverdomains>
>>>                         <failoverdomain name="VMS-Area" nofailback="0"
>>> ordered="0" restricted="0">
>>>                                 <failoverdomainnode name="USBack-prox1"
>>> priority="1"/>
>>>                                 <failoverdomainnode name="USBack-prox2"
>>> priority="1"/>
>>>                                 <failoverdomainnode name="USBack-prox3"
>>> priority="1"/>
>>>                         </failoverdomain>
>>>                 </failoverdomains>
>>>                 <resources>
>>>     ....
>>>
>>>
>>>
>>>
>>>     >
>>>     > log messages on USBAck-prox2:
>>>     >
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] Members[2]: 2 3
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A processor
>>>     > joined or left the membership and a new membership was formed.
>>>     > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change:
>>>     USBack-prox1 DOWN
>>>     > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to node 1
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>     > left_list: 1
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>     > left_list: 1
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
>>>     > from node r(0) ip(--.--.--.22)
>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed service
>>>     > synchronization, ready to provide service.
>>>     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>     > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire journal
>>>     > lock...
>>>     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>     > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire journal
>>>     > lock...
>>>     > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node USBack-prox1
>>>     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 dev 0.0
>>>     > agent fence_ipmilan result: error from agent
>>>     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 failed
>>>     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
>>>     node
>>>     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
>>>     node
>>>
>>>     ^^^ good hint here. something is off.
>>>
>>>
>>> ?
>>
>> It means that there is something in that network that tries to connect
>> to the cluster node, without being a cluster node.
>>
>> Fabio
> 
> There is no node in cluster network other than cluster nodes,
> I think "node #1" retries to reconnect dlm and can't.
> 
> There is two try on node#1 :
> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3
> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2

Can you please check that iptables are set correctly and that traffic
between nodes is not behind NAT?

Fabio

> 
> Logs on Node#1:
> Feb 21 13:06:47 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
> failed, forming new configuration.
> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3
> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 3
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum lUS,
> blocking activity
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
> within the non-primary component and will NOT provide any services.
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[1]: 1
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum
> regained, resuming activity
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
> within the primary component and will provide service.
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
> received left_list: 2
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
> received left_list: 0
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
> received left_list: 0
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] chosen
> downlist from node r(0) ip(--.--.--.21)
> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> 
> 
> Logs on Node#2:
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [QUORUM] Members[2]: 2 3
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 DOWN
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
> received left_list: 1
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
> received left_list: 1
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] chosen
> downlist from node r(0) ip(--.--.--.22)
> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1
> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to USBack-prox2
> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal
> lock...
> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal
> lock...
> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
> joined or left the membership and a new membership was formed.
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 UP
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
> received left_list: 2
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
> received left_list: 0
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
> received left_list: 0
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] chosen
> downlist from node r(0) ip(--.--.--.21)
> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
> service synchronization, ready to provide service.
> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
> handle 4e6afb6600000000 protocol
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 3a95f87400000000 protocol
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 1e7ff52100000001 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 22221a7000000002 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 419ac24100000003 start
> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
> handle 440badfc00000001 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 3804823e00000004 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 2463b9ea00000005 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 22221a7000000002 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 419ac24100000003 start
> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined
> error 12 handle 440badfc00000000 protocol
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 3804823e00000004 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 2463b9ea00000005 start
> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
> error 12 handle 1e7ff52100000001 start
> 
>>
>>>
>>>
>>>
>>>     Fabio
>>>
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A processor
>>>     > joined or left the membership and a new membership was formed.
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
>>>     1 2 3
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
>>>     1 2 3
>>>     > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change:
>>>     USBack-prox1 UP
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>     > left_list: 2
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>     > left_list: 0
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>     > left_list: 0
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
>>>     > from node r(0) ip(--.--.--.21)
>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed service
>>>     > synchronization, ready to provide service.
>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>     > error 12 handle 3a95f87400000000 protocol
>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>     > error 12 handle 1e7ff52100000001 start
>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>     > error 12 handle 22221a7000000002 start
>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>     > error 12 handle 419ac24100000003 start
>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>     > error 12 handle 3804823e00000004 start
>>>     >
>>>     >
>>>     > -------------------------------------------------
>>>     > Then GFS2 generates error logs (Activities blocked).
>>>     >
>>>     > Logs of cisco switch (Time is UTC):
>>>     >
>>>     > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>     > GigabitEthernet0/11, changed state to down
>>>     > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>     > GigabitEthernet0/4, changed state to down
>>>     > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>     > changed state to down
>>>     > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
>>>     > changed state to down
>>>     > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>     > changed state to up
>>>     > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
>>>     > changed state to up
>>>     > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>     > GigabitEthernet0/11, changed state to up
>>>     > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>     > GigabitEthernet0/4, changed state to up
>>>     > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>     > GigabitEthernet0/11, changed state to down
>>>     > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>     > changed state to down
>>>     > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>     > changed state to up
>>>     > _______________________________________________
>>>     > discuss mailing list
>>>     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>     > http://lists.corosync.org/mailman/listinfo/discuss
>>>     >
>>>
>>>     _______________________________________________
>>>     discuss mailing list
>>>     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>     http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>>
>>

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss