Re: Strange corosync fail ...

cluster lab <cluster.labs@xxxxxxxxx> · Mon, 24 Feb 2014 11:27:06 +0330

On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
> On 2/24/2014 8:47 AM, cluster lab wrote:
>> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>>> On 02/23/2014 12:59 PM, cluster lab wrote:
>>>>
>>>>
>>>>
>>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx
>>>> <mailto:fdinitto@xxxxxxxxxx>> wrote:
>>>>
>>>>     On 02/22/2014 11:10 AM, cluster lab wrote:
>>>>     > hi,
>>>>     >
>>>>     > At the middle of cluster activity i received this messages: (cluster
>>>>     > is 3 node with SAN ... GFS2 filesystem)
>>>>
>>>>     OS? version of the packages? cluster.conf
>>>>
>>>>
>>>> OS: SL (Scientific Linux 6),
>>>>
>>>> Packages:
>>>> kernel-2.6.32-71.29.1.el6.x86_64
>>>> rgmanager-3.0.12.1-12.el6.x86_64
>>>> cman-3.0.12-23.el6.x86_64
>>>> corosynclib-1.2.3-21.el6.x86_64
>>>> corosync-1.2.3-21.el6.x86_64
>>>>
>>>> Cluster.conf:
>>>>
>>>> <?xml version="1.0"?>
>>>> <cluster config_version="224" name="USBackCluster">
>>>>         <fence_daemon clean_start="0" post_fail_delay="10"
>>>> post_join_delay="3"/>
>>>>         <clusternodes>
>>>>                 <clusternode name="USBack-prox1" nodeid="1" votes="1">
>>>>                         <fence>
>>>>                                 <method name="ilo">
>>>>                                         <device name="USBack-prox1-ilo"/>
>>>>                                 </method>
>>>>                         </fence>
>>>>                 </clusternode>
>>>>                 <clusternode name="USBack-prox2" nodeid="2" votes="1">
>>>>                         <fence>
>>>>                                 <method name="ilo">
>>>>                                         <device name="USBack-prox2-ilo"/>
>>>>                                 </method>
>>>>                         </fence>
>>>>                 </clusternode>
>>>>                 <clusternode name="USBack-prox3" nodeid="3" votes="1">
>>>>                         <fence>
>>>>                                 <method name="ilo">
>>>>                                         <device name="USBack-prox3-ilo"/>
>>>>                                 </method>
>>>>                         </fence>
>>>>                 </clusternode>
>>>>         </clusternodes>
>>>>         <cman/>
>>>>         <fencedevices>
>>>>                 ... fence config ...
>>>>         </fencedevices>
>>>>         <rm>
>>>>                 <failoverdomains>
>>>>                         <failoverdomain name="VMS-Area" nofailback="0"
>>>> ordered="0" restricted="0">
>>>>                                 <failoverdomainnode name="USBack-prox1"
>>>> priority="1"/>
>>>>                                 <failoverdomainnode name="USBack-prox2"
>>>> priority="1"/>
>>>>                                 <failoverdomainnode name="USBack-prox3"
>>>> priority="1"/>
>>>>                         </failoverdomain>
>>>>                 </failoverdomains>
>>>>                 <resources>
>>>>     ....
>>>>
>>>>
>>>>
>>>>
>>>>     >
>>>>     > log messages on USBAck-prox2:
>>>>     >
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] Members[2]: 2 3
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A processor
>>>>     > joined or left the membership and a new membership was formed.
>>>>     > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change:
>>>>     USBack-prox1 DOWN
>>>>     > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to node 1
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>>     > left_list: 1
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>>     > left_list: 1
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
>>>>     > from node r(0) ip(--.--.--.22)
>>>>     > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed service
>>>>     > synchronization, ready to provide service.
>>>>     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>     > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire journal
>>>>     > lock...
>>>>     > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>     > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire journal
>>>>     > lock...
>>>>     > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node USBack-prox1
>>>>     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 dev 0.0
>>>>     > agent fence_ipmilan result: error from agent
>>>>     > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 failed
>>>>     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
>>>>     node
>>>>     > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster
>>>>     node
>>>>
>>>>     ^^^ good hint here. something is off.
>>>>
>>>>
>>>> ?
>>>
>>> It means that there is something in that network that tries to connect
>>> to the cluster node, without being a cluster node.
>>>
>>> Fabio
>>
>> There is no node in cluster network other than cluster nodes,
>> I think "node #1" retries to reconnect dlm and can't.
>>
>> There is two try on node#1 :
>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3
>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>
> Can you please check that iptables are set correctly and that traffic
> between nodes is not behind NAT?
>
> Fabio

IPtable is disable,
Traffic between cluster nodes is flat. without any NAT

>
>>
>> Logs on Node#1:
>> Feb 21 13:06:47 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>> failed, forming new configuration.
>> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3
>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 3
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum lUS,
>> blocking activity
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>> within the non-primary component and will NOT provide any services.
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[1]: 1
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum
>> regained, resuming activity
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>> within the primary component and will provide service.
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2 3
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>> received left_list: 2
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>> received left_list: 0
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>> received left_list: 0
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] chosen
>> downlist from node r(0) ip(--.--.--.21)
>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [MAIN  ] Completed
>> service synchronization, ready to provide service.
>>
>>
>> Logs on Node#2:
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [QUORUM] Members[2]: 2 3
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 DOWN
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>> received left_list: 1
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>> received left_list: 1
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>> downlist from node r(0) ip(--.--.--.22)
>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>> service synchronization, ready to provide service.
>> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1
>> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to USBack-prox2
>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal
>> lock...
>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal
>> lock...
>> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>> joined or left the membership and a new membership was formed.
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2 3
>> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 UP
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>> received left_list: 2
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>> received left_list: 0
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>> received left_list: 0
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>> downlist from node r(0) ip(--.--.--.21)
>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>> service synchronization, ready to provide service.
>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>> handle 4e6afb6600000000 protocol
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 3a95f87400000000 protocol
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 1e7ff52100000001 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 22221a7000000002 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 419ac24100000003 start
>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>> handle 440badfc00000001 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 3804823e00000004 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 2463b9ea00000005 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 22221a7000000002 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 419ac24100000003 start
>> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined
>> error 12 handle 440badfc00000000 protocol
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 3804823e00000004 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 2463b9ea00000005 start
>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>> error 12 handle 1e7ff52100000001 start
>>
>>>
>>>>
>>>>
>>>>
>>>>     Fabio
>>>>
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A processor
>>>>     > joined or left the membership and a new membership was formed.
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
>>>>     1 2 3
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]:
>>>>     1 2 3
>>>>     > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change:
>>>>     USBack-prox1 UP
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>>     > left_list: 2
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>>     > left_list: 0
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received
>>>>     > left_list: 0
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen downlist
>>>>     > from node r(0) ip(--.--.--.21)
>>>>     > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed service
>>>>     > synchronization, ready to provide service.
>>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>>     > error 12 handle 3a95f87400000000 protocol
>>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>>     > error 12 handle 1e7ff52100000001 start
>>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>>     > error 12 handle 22221a7000000002 start
>>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>>     > error 12 handle 419ac24100000003 start
>>>>     > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined
>>>>     > error 12 handle 3804823e00000004 start
>>>>     >
>>>>     >
>>>>     > -------------------------------------------------
>>>>     > Then GFS2 generates error logs (Activities blocked).
>>>>     >
>>>>     > Logs of cisco switch (Time is UTC):
>>>>     >
>>>>     > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>>     > GigabitEthernet0/11, changed state to down
>>>>     > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>>     > GigabitEthernet0/4, changed state to down
>>>>     > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>>     > changed state to down
>>>>     > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
>>>>     > changed state to down
>>>>     > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>>     > changed state to up
>>>>     > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface GigabitEthernet0/4,
>>>>     > changed state to up
>>>>     > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>>     > GigabitEthernet0/11, changed state to up
>>>>     > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>>     > GigabitEthernet0/4, changed state to up
>>>>     > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>>>>     > GigabitEthernet0/11, changed state to down
>>>>     > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>>     > changed state to down
>>>>     > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface GigabitEthernet0/11,
>>>>     > changed state to up
>>>>     > _______________________________________________
>>>>     > discuss mailing list
>>>>     > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>     > http://lists.corosync.org/mailman/listinfo/discuss
>>>>     >
>>>>
>>>>     _______________________________________________
>>>>     discuss mailing list
>>>>     discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>     http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>>>
>>>
>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss