On Mon, Feb 24, 2014 at 11:23 AM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote: > On 2/24/2014 8:47 AM, cluster lab wrote: >> On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote: >>> On 02/23/2014 12:59 PM, cluster lab wrote: >>>> >>>> >>>> >>>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx >>>> <mailto:fdinitto@xxxxxxxxxx>> wrote: >>>> >>>> On 02/22/2014 11:10 AM, cluster lab wrote: >>>> > hi, >>>> > >>>> > At the middle of cluster activity i received this messages: (cluster >>>> > is 3 node with SAN ... GFS2 filesystem) >>>> >>>> OS? version of the packages? cluster.conf >>>> >>>> >>>> OS: SL (Scientific Linux 6), >>>> >>>> Packages: >>>> kernel-2.6.32-71.29.1.el6.x86_64 >>>> rgmanager-3.0.12.1-12.el6.x86_64 >>>> cman-3.0.12-23.el6.x86_64 >>>> corosynclib-1.2.3-21.el6.x86_64 >>>> corosync-1.2.3-21.el6.x86_64 >>>> >>>> Cluster.conf: >>>> >>>> <?xml version="1.0"?> >>>> <cluster config_version="224" name="USBackCluster"> >>>> <fence_daemon clean_start="0" post_fail_delay="10" >>>> post_join_delay="3"/> >>>> <clusternodes> >>>> <clusternode name="USBack-prox1" nodeid="1" votes="1"> >>>> <fence> >>>> <method name="ilo"> >>>> <device name="USBack-prox1-ilo"/> >>>> </method> >>>> </fence> >>>> </clusternode> >>>> <clusternode name="USBack-prox2" nodeid="2" votes="1"> >>>> <fence> >>>> <method name="ilo"> >>>> <device name="USBack-prox2-ilo"/> >>>> </method> >>>> </fence> >>>> </clusternode> >>>> <clusternode name="USBack-prox3" nodeid="3" votes="1"> >>>> <fence> >>>> <method name="ilo"> >>>> <device name="USBack-prox3-ilo"/> >>>> </method> >>>> </fence> >>>> </clusternode> >>>> </clusternodes> >>>> <cman/> >>>> <fencedevices> >>>> ... fence config ... >>>> </fencedevices> >>>> <rm> >>>> <failoverdomains> >>>> <failoverdomain name="VMS-Area" nofailback="0" >>>> ordered="0" restricted="0"> >>>> <failoverdomainnode name="USBack-prox1" >>>> priority="1"/> >>>> <failoverdomainnode name="USBack-prox2" >>>> priority="1"/> >>>> <failoverdomainnode name="USBack-prox3" >>>> priority="1"/> >>>> </failoverdomain> >>>> </failoverdomains> >>>> <resources> >>>> .... >>>> >>>> >>>> >>>> >>>> > >>>> > log messages on USBAck-prox2: >>>> > >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] Members[2]: 2 3 >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A processor >>>> > joined or left the membership and a new membership was formed. >>>> > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change: >>>> USBack-prox1 DOWN >>>> > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to node 1 >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received >>>> > left_list: 1 >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received >>>> > left_list: 1 >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen downlist >>>> > from node r(0) ip(--.--.--.22) >>>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed service >>>> > synchronization, ready to provide service. >>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>> > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire journal >>>> > lock... >>>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>>> > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire journal >>>> > lock... >>>> > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node USBack-prox1 >>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 dev 0.0 >>>> > agent fence_ipmilan result: error from agent >>>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 failed >>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster >>>> node >>>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster >>>> node >>>> >>>> ^^^ good hint here. something is off. >>>> >>>> >>>> ? >>> >>> It means that there is something in that network that tries to connect >>> to the cluster node, without being a cluster node. >>> >>> Fabio >> >> There is no node in cluster network other than cluster nodes, >> I think "node #1" retries to reconnect dlm and can't. >> >> There is two try on node#1 : >> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3 >> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 > > Can you please check that iptables are set correctly and that traffic > between nodes is not behind NAT? > > Fabio IPtable is disable, Traffic between cluster nodes is flat. without any NAT > >> >> Logs on Node#1: >> Feb 21 13:06:47 USBack-prox1 corosync[3015]: [TOTEM ] A processor >> failed, forming new configuration. >> Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3 >> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >> Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 3 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum lUS, >> blocking activity >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >> within the non-primary component and will NOT provide any services. >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[1]: 1 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [TOTEM ] A processor >> joined or left the membership and a new membership was formed. >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum >> regained, resuming activity >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is >> within the primary component and will provide service. >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 3 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 3 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >> received left_list: 2 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >> received left_list: 0 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist >> received left_list: 0 >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] chosen >> downlist from node r(0) ip(--.--.--.21) >> Feb 21 13:06:55 USBack-prox1 corosync[3015]: [MAIN ] Completed >> service synchronization, ready to provide service. >> >> >> Logs on Node#2: >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [QUORUM] Members[2]: 2 3 >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [TOTEM ] A processor >> joined or left the membership and a new membership was formed. >> Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 DOWN >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >> received left_list: 1 >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist >> received left_list: 1 >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] chosen >> downlist from node r(0) ip(--.--.--.22) >> Feb 21 13:06:41 USBack-prox3 corosync[2956]: [MAIN ] Completed >> service synchronization, ready to provide service. >> Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1 >> Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to USBack-prox2 >> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >> fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal >> lock... >> Feb 21 13:06:41 USBack-prox3 kernel: GFS2: >> fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal >> lock... >> Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [TOTEM ] A processor >> joined or left the membership and a new membership was formed. >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 3 >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 3 >> Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 UP >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >> received left_list: 2 >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >> received left_list: 0 >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist >> received left_list: 0 >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] chosen >> downlist from node r(0) ip(--.--.--.21) >> Feb 21 13:06:55 USBack-prox3 corosync[2956]: [MAIN ] Completed >> service synchronization, ready to provide service. >> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >> handle 4e6afb6600000000 protocol >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 3a95f87400000000 protocol >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 1e7ff52100000001 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 22221a7000000002 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 419ac24100000003 start >> Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 >> handle 440badfc00000001 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 3804823e00000004 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 2463b9ea00000005 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 22221a7000000002 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 419ac24100000003 start >> Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined >> error 12 handle 440badfc00000000 protocol >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 3804823e00000004 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 2463b9ea00000005 start >> Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined >> error 12 handle 1e7ff52100000001 start >> >>> >>>> >>>> >>>> >>>> Fabio >>>> >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A processor >>>> > joined or left the membership and a new membership was formed. >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]: >>>> 1 2 3 >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]: >>>> 1 2 3 >>>> > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change: >>>> USBack-prox1 UP >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>>> > left_list: 2 >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>>> > left_list: 0 >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>>> > left_list: 0 >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen downlist >>>> > from node r(0) ip(--.--.--.21) >>>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed service >>>> > synchronization, ready to provide service. >>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>>> > error 12 handle 3a95f87400000000 protocol >>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>>> > error 12 handle 1e7ff52100000001 start >>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>>> > error 12 handle 22221a7000000002 start >>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>>> > error 12 handle 419ac24100000003 start >>>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>>> > error 12 handle 3804823e00000004 start >>>> > >>>> > >>>> > ------------------------------------------------- >>>> > Then GFS2 generates error logs (Activities blocked). >>>> > >>>> > Logs of cisco switch (Time is UTC): >>>> > >>>> > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>>> > GigabitEthernet0/11, changed state to down >>>> > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>>> > GigabitEthernet0/4, changed state to down >>>> > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>>> > changed state to down >>>> > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface GigabitEthernet0/4, >>>> > changed state to down >>>> > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>>> > changed state to up >>>> > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface GigabitEthernet0/4, >>>> > changed state to up >>>> > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>>> > GigabitEthernet0/11, changed state to up >>>> > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>>> > GigabitEthernet0/4, changed state to up >>>> > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>>> > GigabitEthernet0/11, changed state to down >>>> > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>>> > changed state to down >>>> > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>>> > changed state to up >>>> > _______________________________________________ >>>> > discuss mailing list >>>> > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>> > http://lists.corosync.org/mailman/listinfo/discuss >>>> > >>>> >>>> _______________________________________________ >>>> discuss mailing list >>>> discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>>> http://lists.corosync.org/mailman/listinfo/discuss >>>> >>>> >>> > _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss