On Mon, Feb 24, 2014 at 1:20 PM, cluster lab <cluster.labs@xxxxxxxxx> wrote: > On Mon, Feb 24, 2014 at 11:37 AM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote: >> cluster lab napsal(a): >> >>> On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> >>> wrote: >>>> >>>> On 2/24/2014 8:47 AM, cluster lab wrote: >>>>> >>>>> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto >>>>> <fdinitto@xxxxxxxxxx> wrote: >>>>>> >>>>>> On 02/23/2014 12:59 PM, cluster lab wrote: >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto >>>>>>> <fdinitto@xxxxxxxxxx >>>>>>> <mailto:fdinitto@xxxxxxxxxx>> wrote: >>>>>>> >>>>>>> On 02/22/2014 11:10 AM, cluster lab wrote: >>>>>>> > hi, >>>>>>> > >>>>>>> > At the middle of cluster activity i received this messages: >>>>>>> (cluster >>>>>>> > is 3 node with SAN ... GFS2 filesystem) >>>>>>> >>>>>>> OS? version of the packages? cluster.conf >>>>>>> >>>>>>> >>>>>>> OS: SL (Scientific Linux 6), >>>>>>> >>>>>>> Packages: >>>>>>> kernel-2.6.32-71.29.1.el6.x86_64 >>>>>>> rgmanager-3.0.12.1-12.el6.x86_64 >>>>>>> cman-3.0.12-23.el6.x86_64 >>>>>>> corosynclib-1.2.3-21.el6.x86_64 >>>>>>> corosync-1.2.3-21.el6.x86_64 >>>>>>> >> >> ^^^^ This is really really really corosync for SL 6.0 GOLD. It is >> unsupported and known to be pretty buggy (if problem you hit is only one you >> hit, you are pretty lucky guy). >> >> Please update to something little less ancient. >> >> Regards, >> Honza > > The last package on redhat repository is (1.4.7), Do you recommend > this package? > Excuse me: 1.4.1-7 >> >>>>>>> Cluster.conf: >>>>>>> >>>>>>> <?xml version="1.0"?> >>>>>>> <cluster config_version="224" name="USBackCluster"> >>>>>>> <fence_daemon clean_start="0" post_fail_delay="10" >>>>>>> post_join_delay="3"/> >>>>>>> <clusternodes> >>>>>>> <clusternode name="USBack-prox1" nodeid="1" >>>>>>> votes="1"> >>>>>>> <fence> >>>>>>> <method name="ilo"> >>>>>>> <device >>>>>>> name="USBack-prox1-ilo"/> >>>>>>> </method> >>>>>>> </fence> >>>>>>> </clusternode> >>>>>>> <clusternode name="USBack-prox2" nodeid="2" >>>>>>> votes="1"> >>>>>>> <fence> >>>>>>> <method name="ilo"> >>>>>>> <device >>>>>>> name="USBack-prox2-ilo"/> >>>>>>> </method> >>>>>>> </fence> >>>>>>> </clusternode> >>>>>>> <clusternode name="USBack-prox3" nodeid="3" >>>>>>> votes="1"> >>>>>>> <fence> >>>>>>> <method name="ilo"> >>>>>>> <device >>>>>>> name="USBack-prox3-ilo"/> >>>>>>> </method> >>>>>>> </fence> >>>>>>> </clusternode> >>>>>>> </clusternodes> >>>>>>> <cman/> >>>>>>> <fencedevices> >>>>>>> ... fence config ... >>>>>>> </fencedevices> >>>>>>> <rm> >>>>>>> <failoverdomains> >>>>>>> <failoverdomain name="VMS-Area" >>>>>>> nofailback="0" >>>>>>> ordered="0" restricted="0"> >>>>>>> <failoverdomainnode >>>>>>> name="USBack-prox1" >>>>>>> priority="1"/> >>>>>>> <failoverdomainnode >>>>>>> name="USBack-prox2" >>>>>>> priority="1"/> >>>>>>> <failoverdomainnode >>>>>>> name="USBack-prox3" >>>>>>> priority="1"/> >>>>>>> </failoverdomain> >>>>>>> </failoverdomains> >>>>>>> <resources> >>>>>>> .... >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> > >>>>>>> > log messages on USBAck-prox2: >>>>>>> > >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] >>>>>>> Members[2]: 2 3 >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A >>>>>>> processor >>>>>>> > joined or left the membership and a new membership was formed. >>>>>>> > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change: >>>>>>> USBack-prox1 DOWN >>>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to >>>>>>> node 1 >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>>> received >>>>>>> > left_list: 1 >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>>> received >>>>>>> > left_list: 1 >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen >>>>>>> downlist >>>>>>> > from node r(0) ip(--.--.--.22) >>>>>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed >>>>>>> service >>>>>>> > synchronization, ready to provide service. >>>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>>>>> > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire >>>>>>> journal >>>>>>> > lock... >>>>>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>>>>> > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire >>>>>>> journal >>>>>>> > lock... >>>>>>> > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node >>>>>>> USBack-prox1 >>>>>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 >>>>>>> dev 0.0 >>>>>>> > agent fence_ipmilan result: error from agent >>>>>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 >>>>>>> failed >>>>>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non >>>>>>> cluster >>>>>>> node >>>>>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non >>>>>>> cluster >>>>>>> node >>>>>>> >>>>>>> ^^^ good hint here. something is off. >>>>>>> >>>>>>> >>>>>>> ? >>>>>> >>>>>> >>>>>> It means that there is something in that network that tries to connect >>>>>> to the cluster node, without being a cluster node. >>>>>> >>>>>> Fabio >>>>> >>>>> >>>>> There is no node in cluster network other than cluster nodes, >>>>> I think "node #1" retries to reconnect dlm and can't. >>>>> >>>>> There is two try on node#1 : >>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3 >>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>> >>>> >>>> Can you please check that iptables are set correctly and that traffic >>>> between nodes is not behind NAT? >>>> >>>> Fabio >>> >>> >>> IPtable is disable, >>> Traffic between cluster nodes is flat. without any NAT >>> >>>> >>>>> >>>>> Logs on Node#1: >>>>> Feb 21 13:06:47 USBack-prox1 corosync[3015]: [TOTEM ] A processor >>>>> failed, forming new configuration. >>>>> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3 >>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>>> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 3 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum lUS, >>>>> blocking activity >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >>>>> within the non-primary component and will NOT provide any services. >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[1]: 1 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [TOTEM ] A processor >>>>> joined or left the membership and a new membership was formed. >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum >>>>> regained, resuming activity >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >>>>> within the primary component and will provide service. >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 >>>>> 3 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 >>>>> 3 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>>> received left_list: 2 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>>> received left_list: 0 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >>>>> received left_list: 0 >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] chosen >>>>> downlist from node r(0) ip(--.--.--.21) >>>>> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [MAIN ] Completed >>>>> service synchronization, ready to provide service. >>>>> >>>>> >>>>> Logs on Node#2: >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [QUORUM] Members[2]: 2 3 >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [TOTEM ] A processor >>>>> joined or left the membership and a new membership was formed. >>>>> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 >>>>> DOWN >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >>>>> received left_list: 1 >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >>>>> received left_list: 1 >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] chosen >>>>> downlist from node r(0) ip(--.--.--.22) >>>>> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [MAIN ] Completed >>>>> service synchronization, ready to provide service. >>>>> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1 >>>>> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to >>>>> USBack-prox2 >>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >>>>> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal >>>>> lock... >>>>> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >>>>> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal >>>>> lock... >>>>> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [TOTEM ] A processor >>>>> joined or left the membership and a new membership was formed. >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 >>>>> 3 >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 >>>>> 3 >>>>> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 >>>>> UP >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>>> received left_list: 2 >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>>> received left_list: 0 >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >>>>> received left_list: 0 >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] chosen >>>>> downlist from node r(0) ip(--.--.--.21) >>>>> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [MAIN ] Completed >>>>> service synchronization, ready to provide service. >>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >>>>> handle 4e6afb6600000000 protocol >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 3a95f87400000000 protocol >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 1e7ff52100000001 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 22221a7000000002 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 419ac24100000003 start >>>>> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >>>>> handle 440badfc00000001 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 3804823e00000004 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 2463b9ea00000005 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 22221a7000000002 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 419ac24100000003 start >>>>> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined >>>>> error 12 handle 440badfc00000000 protocol >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 3804823e00000004 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 2463b9ea00000005 start >>>>> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >>>>> error 12 handle 1e7ff52100000001 start >>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Fabio >>>>>>> >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A >>>>>>> processor >>>>>>> > joined or left the membership and a new membership was formed. >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] >>>>>>> Members[3]: >>>>>>> 1 2 3 >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] >>>>>>> Members[3]: >>>>>>> 1 2 3 >>>>>>> > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change: >>>>>>> USBack-prox1 UP >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>>> received >>>>>>> > left_list: 2 >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>>> received >>>>>>> > left_list: 0 >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist >>>>>>> received >>>>>>> > left_list: 0 >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen >>>>>>> downlist >>>>>>> > from node r(0) ip(--.--.--.21) >>>>>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed >>>>>>> service >>>>>>> > synchronization, ready to provide service. >>>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>>> cpg_mcast_joined >>>>>>> > error 12 handle 3a95f87400000000 protocol >>>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>>> cpg_mcast_joined >>>>>>> > error 12 handle 1e7ff52100000001 start >>>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>>> cpg_mcast_joined >>>>>>> > error 12 handle 22221a7000000002 start >>>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>>> cpg_mcast_joined >>>>>>> > error 12 handle 419ac24100000003 start >>>>>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: >>>>>>> cpg_mcast_joined >>>>>>> > error 12 handle 3804823e00000004 start >>>>>>> > >>>>>>> > >>>>>>> > ------------------------------------------------- >>>>>>> > Then GFS2 generates error logs (Activities blocked). >>>>>>> > >>>>>>> > Logs of cisco switch (Time is UTC): >>>>>>> > >>>>>>> > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>>> Interface >>>>>>> > GigabitEthernet0/11, changed state to down >>>>>>> > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>>> Interface >>>>>>> > GigabitEthernet0/4, changed state to down >>>>>>> > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/11, >>>>>>> > changed state to down >>>>>>> > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/4, >>>>>>> > changed state to down >>>>>>> > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/11, >>>>>>> > changed state to up >>>>>>> > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/4, >>>>>>> > changed state to up >>>>>>> > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>>> Interface >>>>>>> > GigabitEthernet0/11, changed state to up >>>>>>> > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>>> Interface >>>>>>> > GigabitEthernet0/4, changed state to up >>>>>>> > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on >>>>>>> Interface >>>>>>> > GigabitEthernet0/11, changed state to down >>>>>>> > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/11, >>>>>>> > changed state to down >>>>>>> > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface >>>>>>> GigabitEthernet0/11, >>>>>>> > changed state to up >>>>>>> > _______________________________________________ >>>>>>> > discuss mailing list >>>>>>> > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>>>>> > http://lists.corosync.org/mailman/listinfo/discuss >>>>>>> > >>>>>>> >>>>>>> _______________________________________________ >>>>>>> discuss mailing list >>>>>>> discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>>>>> http://lists.corosync.org/mailman/listinfo/discuss >>>>>>> >>>>>>> >>>>>> >>>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >> _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss