Re: Strange corosync fail ...

cluster lab <cluster.labs@xxxxxxxxx> · Mon, 24 Feb 2014 13:20:19 +0330

On Mon, Feb 24, 2014 at 11:37 AM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:
> cluster lab napsal(a):
>
>> On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx>
>> wrote:
>>>
>>> On 2/24/2014 8:47 AM, cluster lab wrote:
>>>>
>>>> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto
>>>> <fdinitto@xxxxxxxxxx> wrote:
>>>>>
>>>>> On 02/23/2014 12:59 PM, cluster lab wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto
>>>>>> <fdinitto@xxxxxxxxxx
>>>>>> <mailto:fdinitto@xxxxxxxxxx>> wrote:
>>>>>>
>>>>>>      On 02/22/2014 11:10 AM, cluster lab wrote:
>>>>>>      > hi,
>>>>>>      >
>>>>>>      > At the middle of cluster activity i received this messages:
>>>>>> (cluster
>>>>>>      > is 3 node with SAN ... GFS2 filesystem)
>>>>>>
>>>>>>      OS? version of the packages? cluster.conf
>>>>>>
>>>>>>
>>>>>> OS: SL (Scientific Linux 6),
>>>>>>
>>>>>> Packages:
>>>>>> kernel-2.6.32-71.29.1.el6.x86_64
>>>>>> rgmanager-3.0.12.1-12.el6.x86_64
>>>>>> cman-3.0.12-23.el6.x86_64
>>>>>> corosynclib-1.2.3-21.el6.x86_64
>>>>>> corosync-1.2.3-21.el6.x86_64
>>>>>>
>
> ^^^^ This is really really really corosync for SL 6.0 GOLD. It is
> unsupported and known to be pretty buggy (if problem you hit is only one you
> hit, you are pretty lucky guy).
>
> Please update to something little less ancient.
>
> Regards,
>   Honza

The last package on redhat repository is (1.4.7),  Do you recommend
this package?

>
>>>>>> Cluster.conf:
>>>>>>
>>>>>> <?xml version="1.0"?>
>>>>>> <cluster config_version="224" name="USBackCluster">
>>>>>>          <fence_daemon clean_start="0" post_fail_delay="10"
>>>>>> post_join_delay="3"/>
>>>>>>          <clusternodes>
>>>>>>                  <clusternode name="USBack-prox1" nodeid="1"
>>>>>> votes="1">
>>>>>>                          <fence>
>>>>>>                                  <method name="ilo">
>>>>>>                                          <device
>>>>>> name="USBack-prox1-ilo"/>
>>>>>>                                  </method>
>>>>>>                          </fence>
>>>>>>                  </clusternode>
>>>>>>                  <clusternode name="USBack-prox2" nodeid="2"
>>>>>> votes="1">
>>>>>>                          <fence>
>>>>>>                                  <method name="ilo">
>>>>>>                                          <device
>>>>>> name="USBack-prox2-ilo"/>
>>>>>>                                  </method>
>>>>>>                          </fence>
>>>>>>                  </clusternode>
>>>>>>                  <clusternode name="USBack-prox3" nodeid="3"
>>>>>> votes="1">
>>>>>>                          <fence>
>>>>>>                                  <method name="ilo">
>>>>>>                                          <device
>>>>>> name="USBack-prox3-ilo"/>
>>>>>>                                  </method>
>>>>>>                          </fence>
>>>>>>                  </clusternode>
>>>>>>          </clusternodes>
>>>>>>          <cman/>
>>>>>>          <fencedevices>
>>>>>>                  ... fence config ...
>>>>>>          </fencedevices>
>>>>>>          <rm>
>>>>>>                  <failoverdomains>
>>>>>>                          <failoverdomain name="VMS-Area"
>>>>>> nofailback="0"
>>>>>> ordered="0" restricted="0">
>>>>>>                                  <failoverdomainnode
>>>>>> name="USBack-prox1"
>>>>>> priority="1"/>
>>>>>>                                  <failoverdomainnode
>>>>>> name="USBack-prox2"
>>>>>> priority="1"/>
>>>>>>                                  <failoverdomainnode
>>>>>> name="USBack-prox3"
>>>>>> priority="1"/>
>>>>>>                          </failoverdomain>
>>>>>>                  </failoverdomains>
>>>>>>                  <resources>
>>>>>>      ....
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      >
>>>>>>      > log messages on USBAck-prox2:
>>>>>>      >
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>> Members[2]: 2 3
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A
>>>>>> processor
>>>>>>      > joined or left the membership and a new membership was formed.
>>>>>>      > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change:
>>>>>>      USBack-prox1 DOWN
>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to
>>>>>> node 1
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>> received
>>>>>>      > left_list: 1
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>> received
>>>>>>      > left_list: 1
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen
>>>>>> downlist
>>>>>>      > from node r(0) ip(--.--.--.22)
>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed
>>>>>> service
>>>>>>      > synchronization, ready to provide service.
>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>>>      > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire
>>>>>> journal
>>>>>>      > lock...
>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>>>      > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire
>>>>>> journal
>>>>>>      > lock...
>>>>>>      > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node
>>>>>> USBack-prox1
>>>>>>      > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1
>>>>>> dev 0.0
>>>>>>      > agent fence_ipmilan result: error from agent
>>>>>>      > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1
>>>>>> failed
>>>>>>      > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non
>>>>>> cluster
>>>>>>      node
>>>>>>      > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non
>>>>>> cluster
>>>>>>      node
>>>>>>
>>>>>>      ^^^ good hint here. something is off.
>>>>>>
>>>>>>
>>>>>> ?
>>>>>
>>>>>
>>>>> It means that there is something in that network that tries to connect
>>>>> to the cluster node, without being a cluster node.
>>>>>
>>>>> Fabio
>>>>
>>>>
>>>> There is no node in cluster network other than cluster nodes,
>>>> I think "node #1" retries to reconnect dlm and can't.
>>>>
>>>> There is two try on node#1 :
>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3
>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>
>>>
>>> Can you please check that iptables are set correctly and that traffic
>>> between nodes is not behind NAT?
>>>
>>> Fabio
>>
>>
>> IPtable is disable,
>> Traffic between cluster nodes is flat. without any NAT
>>
>>>
>>>>
>>>> Logs on Node#1:
>>>> Feb 21 13:06:47 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>>>> failed, forming new configuration.
>>>> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3
>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 3
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum lUS,
>>>> blocking activity
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>>>> within the non-primary component and will NOT provide any services.
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[1]: 1
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>>>> joined or left the membership and a new membership was formed.
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum
>>>> regained, resuming activity
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>>>> within the primary component and will provide service.
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2
>>>> 3
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2
>>>> 3
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>> received left_list: 2
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>> received left_list: 0
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>> received left_list: 0
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] chosen
>>>> downlist from node r(0) ip(--.--.--.21)
>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [MAIN  ] Completed
>>>> service synchronization, ready to provide service.
>>>>
>>>>
>>>> Logs on Node#2:
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [QUORUM] Members[2]: 2 3
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>>>> joined or left the membership and a new membership was formed.
>>>> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1
>>>> DOWN
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>> received left_list: 1
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>> received left_list: 1
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>>>> downlist from node r(0) ip(--.--.--.22)
>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>>>> service synchronization, ready to provide service.
>>>> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1
>>>> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to
>>>> USBack-prox2
>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>>>> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal
>>>> lock...
>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>>>> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal
>>>> lock...
>>>> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>>>> joined or left the membership and a new membership was formed.
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2
>>>> 3
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2
>>>> 3
>>>> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1
>>>> UP
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>> received left_list: 2
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>> received left_list: 0
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>> received left_list: 0
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>>>> downlist from node r(0) ip(--.--.--.21)
>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>>>> service synchronization, ready to provide service.
>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>>>> handle 4e6afb6600000000 protocol
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 3a95f87400000000 protocol
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 1e7ff52100000001 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 22221a7000000002 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 419ac24100000003 start
>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>>>> handle 440badfc00000001 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 3804823e00000004 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 2463b9ea00000005 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 22221a7000000002 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 419ac24100000003 start
>>>> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined
>>>> error 12 handle 440badfc00000000 protocol
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 3804823e00000004 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 2463b9ea00000005 start
>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>> error 12 handle 1e7ff52100000001 start
>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>      Fabio
>>>>>>
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A
>>>>>> processor
>>>>>>      > joined or left the membership and a new membership was formed.
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>> Members[3]:
>>>>>>      1 2 3
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>> Members[3]:
>>>>>>      1 2 3
>>>>>>      > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change:
>>>>>>      USBack-prox1 UP
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>> received
>>>>>>      > left_list: 2
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>> received
>>>>>>      > left_list: 0
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>> received
>>>>>>      > left_list: 0
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen
>>>>>> downlist
>>>>>>      > from node r(0) ip(--.--.--.21)
>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed
>>>>>> service
>>>>>>      > synchronization, ready to provide service.
>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>> cpg_mcast_joined
>>>>>>      > error 12 handle 3a95f87400000000 protocol
>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>> cpg_mcast_joined
>>>>>>      > error 12 handle 1e7ff52100000001 start
>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>> cpg_mcast_joined
>>>>>>      > error 12 handle 22221a7000000002 start
>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>> cpg_mcast_joined
>>>>>>      > error 12 handle 419ac24100000003 start
>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>> cpg_mcast_joined
>>>>>>      > error 12 handle 3804823e00000004 start
>>>>>>      >
>>>>>>      >
>>>>>>      > -------------------------------------------------
>>>>>>      > Then GFS2 generates error logs (Activities blocked).
>>>>>>      >
>>>>>>      > Logs of cisco switch (Time is UTC):
>>>>>>      >
>>>>>>      > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>> Interface
>>>>>>      > GigabitEthernet0/11, changed state to down
>>>>>>      > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>> Interface
>>>>>>      > GigabitEthernet0/4, changed state to down
>>>>>>      > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/11,
>>>>>>      > changed state to down
>>>>>>      > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/4,
>>>>>>      > changed state to down
>>>>>>      > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/11,
>>>>>>      > changed state to up
>>>>>>      > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/4,
>>>>>>      > changed state to up
>>>>>>      > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>> Interface
>>>>>>      > GigabitEthernet0/11, changed state to up
>>>>>>      > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>> Interface
>>>>>>      > GigabitEthernet0/4, changed state to up
>>>>>>      > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>> Interface
>>>>>>      > GigabitEthernet0/11, changed state to down
>>>>>>      > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/11,
>>>>>>      > changed state to down
>>>>>>      > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface
>>>>>> GigabitEthernet0/11,
>>>>>>      > changed state to up
>>>>>>      > _______________________________________________
>>>>>>      > discuss mailing list
>>>>>>      > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>>>      > http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>      >
>>>>>>
>>>>>>      _______________________________________________
>>>>>>      discuss mailing list
>>>>>>      discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>>>      http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>
>>>>>>
>>>>>
>>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss