Re: Strange corosync fail ...

cluster lab <cluster.labs@xxxxxxxxx> · Mon, 24 Feb 2014 13:25:42 +0330



On Mon, Feb 24, 2014 at 1:20 PM, cluster lab <cluster.labs@xxxxxxxxx> wrote:
> On Mon, Feb 24, 2014 at 11:37 AM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:
>> cluster lab napsal(a):
>>
>>> On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx>
>>> wrote:
>>>>
>>>> On 2/24/2014 8:47 AM, cluster lab wrote:
>>>>>
>>>>> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto
>>>>> <fdinitto@xxxxxxxxxx> wrote:
>>>>>>
>>>>>> On 02/23/2014 12:59 PM, cluster lab wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto
>>>>>>> <fdinitto@xxxxxxxxxx
>>>>>>> <mailto:fdinitto@xxxxxxxxxx>> wrote:
>>>>>>>
>>>>>>>      On 02/22/2014 11:10 AM, cluster lab wrote:
>>>>>>>      > hi,
>>>>>>>      >
>>>>>>>      > At the middle of cluster activity i received this messages:
>>>>>>> (cluster
>>>>>>>      > is 3 node with SAN ... GFS2 filesystem)
>>>>>>>
>>>>>>>      OS? version of the packages? cluster.conf
>>>>>>>
>>>>>>>
>>>>>>> OS: SL (Scientific Linux 6),
>>>>>>>
>>>>>>> Packages:
>>>>>>> kernel-2.6.32-71.29.1.el6.x86_64
>>>>>>> rgmanager-3.0.12.1-12.el6.x86_64
>>>>>>> cman-3.0.12-23.el6.x86_64
>>>>>>> corosynclib-1.2.3-21.el6.x86_64
>>>>>>> corosync-1.2.3-21.el6.x86_64
>>>>>>>
>>
>> ^^^^ This is really really really corosync for SL 6.0 GOLD. It is
>> unsupported and known to be pretty buggy (if problem you hit is only one you
>> hit, you are pretty lucky guy).
>>
>> Please update to something little less ancient.
>>
>> Regards,
>>   Honza
>
> The last package on redhat repository is (1.4.7),  Do you recommend
> this package?
>

Excuse me: 1.4.1-7

>>
>>>>>>> Cluster.conf:
>>>>>>>
>>>>>>> <?xml version="1.0"?>
>>>>>>> <cluster config_version="224" name="USBackCluster">
>>>>>>>          <fence_daemon clean_start="0" post_fail_delay="10"
>>>>>>> post_join_delay="3"/>
>>>>>>>          <clusternodes>
>>>>>>>                  <clusternode name="USBack-prox1" nodeid="1"
>>>>>>> votes="1">
>>>>>>>                          <fence>
>>>>>>>                                  <method name="ilo">
>>>>>>>                                          <device
>>>>>>> name="USBack-prox1-ilo"/>
>>>>>>>                                  </method>
>>>>>>>                          </fence>
>>>>>>>                  </clusternode>
>>>>>>>                  <clusternode name="USBack-prox2" nodeid="2"
>>>>>>> votes="1">
>>>>>>>                          <fence>
>>>>>>>                                  <method name="ilo">
>>>>>>>                                          <device
>>>>>>> name="USBack-prox2-ilo"/>
>>>>>>>                                  </method>
>>>>>>>                          </fence>
>>>>>>>                  </clusternode>
>>>>>>>                  <clusternode name="USBack-prox3" nodeid="3"
>>>>>>> votes="1">
>>>>>>>                          <fence>
>>>>>>>                                  <method name="ilo">
>>>>>>>                                          <device
>>>>>>> name="USBack-prox3-ilo"/>
>>>>>>>                                  </method>
>>>>>>>                          </fence>
>>>>>>>                  </clusternode>
>>>>>>>          </clusternodes>
>>>>>>>          <cman/>
>>>>>>>          <fencedevices>
>>>>>>>                  ... fence config ...
>>>>>>>          </fencedevices>
>>>>>>>          <rm>
>>>>>>>                  <failoverdomains>
>>>>>>>                          <failoverdomain name="VMS-Area"
>>>>>>> nofailback="0"
>>>>>>> ordered="0" restricted="0">
>>>>>>>                                  <failoverdomainnode
>>>>>>> name="USBack-prox1"
>>>>>>> priority="1"/>
>>>>>>>                                  <failoverdomainnode
>>>>>>> name="USBack-prox2"
>>>>>>> priority="1"/>
>>>>>>>                                  <failoverdomainnode
>>>>>>> name="USBack-prox3"
>>>>>>> priority="1"/>
>>>>>>>                          </failoverdomain>
>>>>>>>                  </failoverdomains>
>>>>>>>                  <resources>
>>>>>>>      ....
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      >
>>>>>>>      > log messages on USBAck-prox2:
>>>>>>>      >
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>>> Members[2]: 2 3
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A
>>>>>>> processor
>>>>>>>      > joined or left the membership and a new membership was formed.
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change:
>>>>>>>      USBack-prox1 DOWN
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to
>>>>>>> node 1
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>>> received
>>>>>>>      > left_list: 1
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>>> received
>>>>>>>      > left_list: 1
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen
>>>>>>> downlist
>>>>>>>      > from node r(0) ip(--.--.--.22)
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed
>>>>>>> service
>>>>>>>      > synchronization, ready to provide service.
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>>>>      > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire
>>>>>>> journal
>>>>>>>      > lock...
>>>>>>>      > Feb 21 13:06:41 USBack-prox2 kernel: GFS2:
>>>>>>>      > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire
>>>>>>> journal
>>>>>>>      > lock...
>>>>>>>      > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node
>>>>>>> USBack-prox1
>>>>>>>      > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1
>>>>>>> dev 0.0
>>>>>>>      > agent fence_ipmilan result: error from agent
>>>>>>>      > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1
>>>>>>> failed
>>>>>>>      > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non
>>>>>>> cluster
>>>>>>>      node
>>>>>>>      > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non
>>>>>>> cluster
>>>>>>>      node
>>>>>>>
>>>>>>>      ^^^ good hint here. something is off.
>>>>>>>
>>>>>>>
>>>>>>> ?
>>>>>>
>>>>>>
>>>>>> It means that there is something in that network that tries to connect
>>>>>> to the cluster node, without being a cluster node.
>>>>>>
>>>>>> Fabio
>>>>>
>>>>>
>>>>> There is no node in cluster network other than cluster nodes,
>>>>> I think "node #1" retries to reconnect dlm and can't.
>>>>>
>>>>> There is two try on node#1 :
>>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3
>>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>>
>>>>
>>>> Can you please check that iptables are set correctly and that traffic
>>>> between nodes is not behind NAT?
>>>>
>>>> Fabio
>>>
>>>
>>> IPtable is disable,
>>> Traffic between cluster nodes is flat. without any NAT
>>>
>>>>
>>>>>
>>>>> Logs on Node#1:
>>>>> Feb 21 13:06:47 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>>>>> failed, forming new configuration.
>>>>> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3
>>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 3
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum lUS,
>>>>> blocking activity
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>>>>> within the non-primary component and will NOT provide any services.
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[1]: 1
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [TOTEM ] A processor
>>>>> joined or left the membership and a new membership was formed.
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CMAN  ] quorum
>>>>> regained, resuming activity
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] This node is
>>>>> within the primary component and will provide service.
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[2]: 1 2
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2
>>>>> 3
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [QUORUM] Members[3]: 1 2
>>>>> 3
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>>> received left_list: 2
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>>> received left_list: 0
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] downlist
>>>>> received left_list: 0
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [CPG   ] chosen
>>>>> downlist from node r(0) ip(--.--.--.21)
>>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]:   [MAIN  ] Completed
>>>>> service synchronization, ready to provide service.
>>>>>
>>>>>
>>>>> Logs on Node#2:
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [QUORUM] Members[2]: 2 3
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>>>>> joined or left the membership and a new membership was formed.
>>>>> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1
>>>>> DOWN
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>>> received left_list: 1
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>>> received left_list: 1
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>>>>> downlist from node r(0) ip(--.--.--.22)
>>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>>>>> service synchronization, ready to provide service.
>>>>> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1
>>>>> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to
>>>>> USBack-prox2
>>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>>>>> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal
>>>>> lock...
>>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2:
>>>>> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal
>>>>> lock...
>>>>> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [TOTEM ] A processor
>>>>> joined or left the membership and a new membership was formed.
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2
>>>>> 3
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [QUORUM] Members[3]: 1 2
>>>>> 3
>>>>> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1
>>>>> UP
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>>> received left_list: 2
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>>> received left_list: 0
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] downlist
>>>>> received left_list: 0
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [CPG   ] chosen
>>>>> downlist from node r(0) ip(--.--.--.21)
>>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]:   [MAIN  ] Completed
>>>>> service synchronization, ready to provide service.
>>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>>>>> handle 4e6afb6600000000 protocol
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 3a95f87400000000 protocol
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 1e7ff52100000001 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 22221a7000000002 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 419ac24100000003 start
>>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12
>>>>> handle 440badfc00000001 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 3804823e00000004 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 2463b9ea00000005 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 22221a7000000002 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 419ac24100000003 start
>>>>> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined
>>>>> error 12 handle 440badfc00000000 protocol
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 3804823e00000004 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 2463b9ea00000005 start
>>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined
>>>>> error 12 handle 1e7ff52100000001 start
>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>      Fabio
>>>>>>>
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A
>>>>>>> processor
>>>>>>>      > joined or left the membership and a new membership was formed.
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>>> Members[3]:
>>>>>>>      1 2 3
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM]
>>>>>>> Members[3]:
>>>>>>>      1 2 3
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change:
>>>>>>>      USBack-prox1 UP
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>>> received
>>>>>>>      > left_list: 2
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>>> received
>>>>>>>      > left_list: 0
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist
>>>>>>> received
>>>>>>>      > left_list: 0
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen
>>>>>>> downlist
>>>>>>>      > from node r(0) ip(--.--.--.21)
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed
>>>>>>> service
>>>>>>>      > synchronization, ready to provide service.
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>>> cpg_mcast_joined
>>>>>>>      > error 12 handle 3a95f87400000000 protocol
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>>> cpg_mcast_joined
>>>>>>>      > error 12 handle 1e7ff52100000001 start
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>>> cpg_mcast_joined
>>>>>>>      > error 12 handle 22221a7000000002 start
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>>> cpg_mcast_joined
>>>>>>>      > error 12 handle 419ac24100000003 start
>>>>>>>      > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]:
>>>>>>> cpg_mcast_joined
>>>>>>>      > error 12 handle 3804823e00000004 start
>>>>>>>      >
>>>>>>>      >
>>>>>>>      > -------------------------------------------------
>>>>>>>      > Then GFS2 generates error logs (Activities blocked).
>>>>>>>      >
>>>>>>>      > Logs of cisco switch (Time is UTC):
>>>>>>>      >
>>>>>>>      > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>>> Interface
>>>>>>>      > GigabitEthernet0/11, changed state to down
>>>>>>>      > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>>> Interface
>>>>>>>      > GigabitEthernet0/4, changed state to down
>>>>>>>      > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/11,
>>>>>>>      > changed state to down
>>>>>>>      > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/4,
>>>>>>>      > changed state to down
>>>>>>>      > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/11,
>>>>>>>      > changed state to up
>>>>>>>      > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/4,
>>>>>>>      > changed state to up
>>>>>>>      > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>>> Interface
>>>>>>>      > GigabitEthernet0/11, changed state to up
>>>>>>>      > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>>> Interface
>>>>>>>      > GigabitEthernet0/4, changed state to up
>>>>>>>      > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on
>>>>>>> Interface
>>>>>>>      > GigabitEthernet0/11, changed state to down
>>>>>>>      > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/11,
>>>>>>>      > changed state to down
>>>>>>>      > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface
>>>>>>> GigabitEthernet0/11,
>>>>>>>      > changed state to up
>>>>>>>      > _______________________________________________
>>>>>>>      > discuss mailing list
>>>>>>>      > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>>>>      > http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>>      >
>>>>>>>
>>>>>>>      _______________________________________________
>>>>>>>      discuss mailing list
>>>>>>>      discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx>
>>>>>>>      http://lists.corosync.org/mailman/listinfo/discuss
>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>> _______________________________________________
>>> discuss mailing list
>>> discuss@xxxxxxxxxxxx
>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss