On Mon, Feb 24, 2014 at 11:37 AM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote: > cluster lab napsal(a): > >> On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> >> wrote: >>> >>> On 2/24/2014 8:47 AM, cluster lab wrote: >>>> >>>> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto >>>> <fdinitto@xxxxxxxxxx> wrote: >>>>> >>>>> On 02/23/2014 12:59 PM, cluster lab wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto >>>>>> <fdinitto@xxxxxxxxxx >>>>>> <mailto:fdinitto@xxxxxxxxxx>> wrote: >>>>>> >>>>>> On 02/22/2014 11:10 AM, cluster lab wrote: >>>>>> > hi, >>>>>> > >>>>>> > At the middle of cluster activity i received this messages: >>>>>> (cluster >>>>>> > is 3 node with SAN ... GFS2 filesystem) >>>>>> >>>>>> OS? version of the packages? cluster.conf >>>>>> >>>>>> >>>>>> OS: SL (Scientific Linux 6), >>>>>> >>>>>> Packages: >>>>>> kernel-2.6.32-71.29.1.el6.x86_64 >>>>>> rgmanager-3.0.12.1-12.el6.x86_64 >>>>>> cman-3.0.12-23.el6.x86_64 >>>>>> corosynclib-1.2.3-21.el6.x86_64 >>>>>> corosync-1.2.3-21.el6.x86_64 >>>>>> > > ^^^^ This is really really really corosync for SL 6.0 GOLD. It is > unsupported and known to be pretty buggy (if problem you hit is only one you > hit, you are pretty lucky guy). > > Please update to something little less ancient. > > Regards, > Honza The last package on redhat repository is (1.4.7), Do you recommend this package? > >>>>>> Cluster.conf: >>>>>> >>>>>> <?xml version="1.0"?> >>>>>> <cluster config_version="224" name="USBackCluster"> >>>>>> <fence_daemon clean_start="0" post_fail_delay="10" >>>>>> post_join_delay="3"/> >>>>>> <clusternodes> >>>>>> <clusternode name="USBack-prox1" nodeid="1" >>>>>> votes="1"> >>>>>> <fence> >>>>>> <method name="ilo"> >>>>>> <device >>>>>> name="USBack-prox1-ilo"/> >>>>>> </method> >>>>>> </fence> >>>>>> </clusternode> >>>>>> <clusternode name="USBack-prox2" nodeid="2" >>>>>> votes="1"> >>>>>> <fence> >>>>>> <method name="ilo"> >>>>>> <device >>>>>> name="USBack-prox2-ilo"/> >>>>>> </method> >>>>>> </fence> >>>>>> </clusternode> >>>>>> <clusternode name="USBack-prox3" nodeid="3" >>>>>> votes="1"> >>>>>> <fence> >>>>>> <method name="ilo"> >>>>>> <device >>>>>> name="USBack-prox3-ilo"/> >>>>>> </method> >>>>>> </fence> >>>>>> </clusternode> >>>>>> </clusternodes> >>>>>> <cman/> >>>>>> <fencedevices> >>>>>> ... fence config ... >>>>>> </fencedevices> >>>>>> <rm> >>>>>> <failoverdomains> >>>>>> <failoverdomain name="VMS-Area" >>>>>> nofailback="0" >>>>>> ordered="0" restricted="0"> >>>>>> <failoverdomainnode >>>>>> name="USBack-prox1" >>>>>> priority="1"/> >>>>>> <failoverdomainnode >>>>>> name="USBack-prox2" >>>>>> priority="1"/> >>>>>> <failoverdomainnode >>>>>> name="USBack-prox3" >>>>>> priority="1"/> >>>>>> </failoverdomain> >>>>>> </failoverdomains> >>>>>> <resources> >>>>>> .... >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> > >>>>>> > log messages on USBAck-prox2: >>>>>> > >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] >>>>>> Members[2]: 2 3 >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A >>>>>> processor >>>>>> > joined or left the membership and a new membership was formed. >>>>>> > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change: >>>>>> USBack-prox1 DOWN >>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to >>>>>> node 1 >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>> received >>>>>> > left_list: 1 >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>> received >>>>>> > left_list: 1 >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen >>>>>> downlist >>>>>> > from node r(0) ip(--.--.--.22) >>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed >>>>>> service >>>>>> > synchronization, ready to provide service. >>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>>>> > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire >>>>>> journal >>>>>> > lock... >>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>>>> > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire >>>>>> journal >>>>>> > lock... >>>>>> > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node >>>>>> USBack-prox1 >>>>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 >>>>>> dev 0.0 >>>>>> > agent fence_ipmilan result: error from agent >>>>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 >>>>>> failed >>>>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non >>>>>> cluster >>>>>> node >>>>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non >>>>>> cluster >>>>>> node >>>>>> >>>>>> ^^^ good hint here. something is off. >>>>>> >>>>>> >>>>>> ? >>>>> >>>>> >>>>> It means that there is something in that network that tries to connect >>>>> to the cluster node, without being a cluster node. >>>>> >>>>> Fabio >>>> >>>> >>>> There is no node in cluster network other than cluster nodes, >>>> I think "node #1" retries to reconnect dlm and can't. >>>> >>>> There is two try on node#1 : >>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3 >>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>> >>> >>> Can you please check that iptables are set correctly and that traffic >>> between nodes is not behind NAT? >>> >>> Fabio >> >> >> IPtable is disable, >> Traffic between cluster nodes is flat. without any NAT >> >>> >>>> >>>> Logs on Node#1: >>>> Feb 21 13:06:47 USBack-prox1 corosync[3015]: [TOTEM ] A processor >>>> failed, forming new configuration. >>>> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3 >>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 3 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum lUS, >>>> blocking activity >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >>>> within the non-primary component and will NOT provide any services. >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[1]: 1 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [TOTEM ] A processor >>>> joined or left the membership and a new membership was formed. >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum >>>> regained, resuming activity >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >>>> within the primary component and will provide service. >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 >>>> 3 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 >>>> 3 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>> received left_list: 2 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>> received left_list: 0 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>> received left_list: 0 >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] chosen >>>> downlist from node r(0) ip(--.--.--.21) >>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [MAIN ] Completed >>>> service synchronization, ready to provide service. >>>> >>>> >>>> Logs on Node#2: >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [QUORUM] Members[2]: 2 3 >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [TOTEM ] A processor >>>> joined or left the membership and a new membership was formed. >>>> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 >>>> DOWN >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >>>> received left_list: 1 >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >>>> received left_list: 1 >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] chosen >>>> downlist from node r(0) ip(--.--.--.22) >>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [MAIN ] Completed >>>> service synchronization, ready to provide service. >>>> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1 >>>> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to >>>> USBack-prox2 >>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >>>> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal >>>> lock... >>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >>>> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal >>>> lock... >>>> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [TOTEM ] A processor >>>> joined or left the membership and a new membership was formed. >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 >>>> 3 >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 >>>> 3 >>>> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 >>>> UP >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>> received left_list: 2 >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>> received left_list: 0 >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>> received left_list: 0 >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] chosen >>>> downlist from node r(0) ip(--.--.--.21) >>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [MAIN ] Completed >>>> service synchronization, ready to provide service. >>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >>>> handle 4e6afb6600000000 protocol >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 3a95f87400000000 protocol >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 1e7ff52100000001 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 22221a7000000002 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 419ac24100000003 start >>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >>>> handle 440badfc00000001 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 3804823e00000004 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 2463b9ea00000005 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 22221a7000000002 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 419ac24100000003 start >>>> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined >>>> error 12 handle 440badfc00000000 protocol >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 3804823e00000004 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 2463b9ea00000005 start >>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>> error 12 handle 1e7ff52100000001 start >>>> >>>>> >>>>>> >>>>>> >>>>>> >>>>>> Fabio >>>>>> >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A >>>>>> processor >>>>>> > joined or left the membership and a new membership was formed. >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] >>>>>> Members[3]: >>>>>> 1 2 3 >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] >>>>>> Members[3]: >>>>>> 1 2 3 >>>>>> > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change: >>>>>> USBack-prox1 UP >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>> received >>>>>> > left_list: 2 >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>> received >>>>>> > left_list: 0 >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>> received >>>>>> > left_list: 0 >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen >>>>>> downlist >>>>>> > from node r(0) ip(--.--.--.21) >>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed >>>>>> service >>>>>> > synchronization, ready to provide service. >>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>> cpg_mcast_joined >>>>>> > error 12 handle 3a95f87400000000 protocol >>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>> cpg_mcast_joined >>>>>> > error 12 handle 1e7ff52100000001 start >>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>> cpg_mcast_joined >>>>>> > error 12 handle 22221a7000000002 start >>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>> cpg_mcast_joined >>>>>> > error 12 handle 419ac24100000003 start >>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>> cpg_mcast_joined >>>>>> > error 12 handle 3804823e00000004 start >>>>>> > >>>>>> > >>>>>> > ------------------------------------------------- >>>>>> > Then GFS2 generates error logs (Activities blocked). >>>>>> > >>>>>> > Logs of cisco switch (Time is UTC): >>>>>> > >>>>>> > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>> Interface >>>>>> > GigabitEthernet0/11, changed state to down >>>>>> > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>> Interface >>>>>> > GigabitEthernet0/4, changed state to down >>>>>> > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/11, >>>>>> > changed state to down >>>>>> > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/4, >>>>>> > changed state to down >>>>>> > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/11, >>>>>> > changed state to up >>>>>> > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/4, >>>>>> > changed state to up >>>>>> > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>> Interface >>>>>> > GigabitEthernet0/11, changed state to up >>>>>> > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>> Interface >>>>>> > GigabitEthernet0/4, changed state to up >>>>>> > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>> Interface >>>>>> > GigabitEthernet0/11, changed state to down >>>>>> > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/11, >>>>>> > changed state to down >>>>>> > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface >>>>>> GigabitEthernet0/11, >>>>>> > changed state to up >>>>>> > _______________________________________________ >>>>>> > discuss mailing list >>>>>> > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>>>> > http://lists.corosync.org/mailman/listinfo/discuss >>>>>> > >>>>>> >>>>>> _______________________________________________ >>>>>> discuss mailing list >>>>>> discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>> >>>>>> >>>>> >>> >> _______________________________________________ >> discuss mailing list >> discuss@xxxxxxxxxxxx >> http://lists.corosync.org/mailman/listinfo/discuss >> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss