On 2/24/2014 8:47 AM, cluster lab wrote: > On Sun, Feb 23, 2014 at 10:40 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote: >> On 02/23/2014 12:59 PM, cluster lab wrote: >>> >>> >>> >>> On Sat, Feb 22, 2014 at 2:28 PM, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx >>> <mailto:fdinitto@xxxxxxxxxx>> wrote: >>> >>> On 02/22/2014 11:10 AM, cluster lab wrote: >>> > hi, >>> > >>> > At the middle of cluster activity i received this messages: (cluster >>> > is 3 node with SAN ... GFS2 filesystem) >>> >>> OS? version of the packages? cluster.conf >>> >>> >>> OS: SL (Scientific Linux 6), >>> >>> Packages: >>> kernel-2.6.32-71.29.1.el6.x86_64 >>> rgmanager-3.0.12.1-12.el6.x86_64 >>> cman-3.0.12-23.el6.x86_64 >>> corosynclib-1.2.3-21.el6.x86_64 >>> corosync-1.2.3-21.el6.x86_64 >>> >>> Cluster.conf: >>> >>> <?xml version="1.0"?> >>> <cluster config_version="224" name="USBackCluster"> >>> <fence_daemon clean_start="0" post_fail_delay="10" >>> post_join_delay="3"/> >>> <clusternodes> >>> <clusternode name="USBack-prox1" nodeid="1" votes="1"> >>> <fence> >>> <method name="ilo"> >>> <device name="USBack-prox1-ilo"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="USBack-prox2" nodeid="2" votes="1"> >>> <fence> >>> <method name="ilo"> >>> <device name="USBack-prox2-ilo"/> >>> </method> >>> </fence> >>> </clusternode> >>> <clusternode name="USBack-prox3" nodeid="3" votes="1"> >>> <fence> >>> <method name="ilo"> >>> <device name="USBack-prox3-ilo"/> >>> </method> >>> </fence> >>> </clusternode> >>> </clusternodes> >>> <cman/> >>> <fencedevices> >>> ... fence config ... >>> </fencedevices> >>> <rm> >>> <failoverdomains> >>> <failoverdomain name="VMS-Area" nofailback="0" >>> ordered="0" restricted="0"> >>> <failoverdomainnode name="USBack-prox1" >>> priority="1"/> >>> <failoverdomainnode name="USBack-prox2" >>> priority="1"/> >>> <failoverdomainnode name="USBack-prox3" >>> priority="1"/> >>> </failoverdomain> >>> </failoverdomains> >>> <resources> >>> .... >>> >>> >>> >>> >>> > >>> > log messages on USBAck-prox2: >>> > >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [QUORUM] Members[2]: 2 3 >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [TOTEM ] A processor >>> > joined or left the membership and a new membership was formed. >>> > Feb 21 13:06:41 USBack-prox2 rgmanager[4130]: State change: >>> USBack-prox1 DOWN >>> > Feb 21 13:06:41 USBack-prox2 kernel: dlm: closing connection to node 1 >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received >>> > left_list: 1 >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] downlist received >>> > left_list: 1 >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [CPG ] chosen downlist >>> > from node r(0) ip(--.--.--.22) >>> > Feb 21 13:06:41 USBack-prox2 corosync[3911]: [MAIN ] Completed service >>> > synchronization, ready to provide service. >>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>> > fsid=USBackCluster:VMStorage1.0: jid=1: Trying to acquire journal >>> > lock... >>> > Feb 21 13:06:41 USBack-prox2 kernel: GFS2: >>> > fsid=USBackCluster:VMStorage2.0: jid=1: Trying to acquire journal >>> > lock... >>> > Feb 21 13:06:51 USBack-prox2 fenced[3957]: fencing node USBack-prox1 >>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 dev 0.0 >>> > agent fence_ipmilan result: error from agent >>> > Feb 21 13:06:52 USBack-prox2 fenced[3957]: fence USBack-prox1 failed >>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster >>> node >>> > Feb 21 13:06:54 USBack-prox2 kernel: dlm: connect from non cluster >>> node >>> >>> ^^^ good hint here. something is off. >>> >>> >>> ? >> >> It means that there is something in that network that tries to connect >> to the cluster node, without being a cluster node. >> >> Fabio > > There is no node in cluster network other than cluster nodes, > I think "node #1" retries to reconnect dlm and can't. > > There is two try on node#1 : > Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 3 > Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 > Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 Can you please check that iptables are set correctly and that traffic between nodes is not behind NAT? Fabio > > Logs on Node#1: > Feb 21 13:06:47 USBack-prox1 corosync[3015]: [TOTEM ] A processor > failed, forming new configuration. > Feb 21 13:06:51 USBack-prox1 kernel: dlm: connecting to 3 > Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 > Feb 21 13:06:54 USBack-prox1 kernel: dlm: connecting to 2 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 3 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum lUS, > blocking activity > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is > within the non-primary component and will NOT provide any services. > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[1]: 1 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CMAN ] quorum > regained, resuming activity > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] This node is > within the primary component and will provide service. > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[2]: 1 2 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 3 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [QUORUM] Members[3]: 1 2 3 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist > received left_list: 2 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist > received left_list: 0 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] downlist > received left_list: 0 > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [CPG ] chosen > downlist from node r(0) ip(--.--.--.21) > Feb 21 13:06:55 USBack-prox1 corosync[3015]: [MAIN ] Completed > service synchronization, ready to provide service. > > > Logs on Node#2: > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [QUORUM] Members[2]: 2 3 > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > Feb 21 13:06:41 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 DOWN > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist > received left_list: 1 > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] downlist > received left_list: 1 > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [CPG ] chosen > downlist from node r(0) ip(--.--.--.22) > Feb 21 13:06:41 USBack-prox3 corosync[2956]: [MAIN ] Completed > service synchronization, ready to provide service. > Feb 21 13:06:41 USBack-prox3 kernel: dlm: closing connection to node 1 > Feb 21 13:06:41 USBack-prox3 fenced[3008]: fencing deferred to USBack-prox2 > Feb 21 13:06:41 USBack-prox3 kernel: GFS2: > fsid=USBackCluster:VMStorage1.2: jid=1: Trying to acquire journal > lock... > Feb 21 13:06:41 USBack-prox3 kernel: GFS2: > fsid=USBackCluster:VMStorage2.2: jid=1: Trying to acquire journal > lock... > Feb 21 13:06:51 USBack-prox3 kernel: dlm: connect from non cluster node > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [TOTEM ] A processor > joined or left the membership and a new membership was formed. > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 3 > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [QUORUM] Members[3]: 1 2 3 > Feb 21 13:06:55 USBack-prox3 rgmanager[3177]: State change: USBack-prox1 UP > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist > received left_list: 2 > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist > received left_list: 0 > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] downlist > received left_list: 0 > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [CPG ] chosen > downlist from node r(0) ip(--.--.--.21) > Feb 21 13:06:55 USBack-prox3 corosync[2956]: [MAIN ] Completed > service synchronization, ready to provide service. > Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 > handle 4e6afb6600000000 protocol > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 3a95f87400000000 protocol > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 1e7ff52100000001 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 22221a7000000002 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 419ac24100000003 start > Feb 21 13:06:55 USBack-prox3 fenced[3008]: cpg_mcast_joined error 12 > handle 440badfc00000001 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 3804823e00000004 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 2463b9ea00000005 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 22221a7000000002 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 419ac24100000003 start > Feb 21 13:06:55 USBack-prox3 dlm_controld[3034]: cpg_mcast_joined > error 12 handle 440badfc00000000 protocol > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 3804823e00000004 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 2463b9ea00000005 start > Feb 21 13:06:55 USBack-prox3 gfs_controld[3062]: cpg_mcast_joined > error 12 handle 1e7ff52100000001 start > >> >>> >>> >>> >>> Fabio >>> >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [TOTEM ] A processor >>> > joined or left the membership and a new membership was formed. >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]: >>> 1 2 3 >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [QUORUM] Members[3]: >>> 1 2 3 >>> > Feb 21 13:06:55 USBack-prox2 rgmanager[4130]: State change: >>> USBack-prox1 UP >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>> > left_list: 2 >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>> > left_list: 0 >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] downlist received >>> > left_list: 0 >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [CPG ] chosen downlist >>> > from node r(0) ip(--.--.--.21) >>> > Feb 21 13:06:55 USBack-prox2 corosync[3911]: [MAIN ] Completed service >>> > synchronization, ready to provide service. >>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>> > error 12 handle 3a95f87400000000 protocol >>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>> > error 12 handle 1e7ff52100000001 start >>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>> > error 12 handle 22221a7000000002 start >>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>> > error 12 handle 419ac24100000003 start >>> > Feb 21 13:06:55 USBack-prox2 gfs_controld[4029]: cpg_mcast_joined >>> > error 12 handle 3804823e00000004 start >>> > >>> > >>> > ------------------------------------------------- >>> > Then GFS2 generates error logs (Activities blocked). >>> > >>> > Logs of cisco switch (Time is UTC): >>> > >>> > Feb 21 09:37:02.375: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>> > GigabitEthernet0/11, changed state to down >>> > Feb 21 09:37:02.459: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>> > GigabitEthernet0/4, changed state to down >>> > Feb 21 09:37:03.382: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>> > changed state to down >>> > Feb 21 09:37:03.541: %LINK-3-UPDOWN: Interface GigabitEthernet0/4, >>> > changed state to down >>> > Feb 21 09:37:07.283: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>> > changed state to up >>> > Feb 21 09:37:07.350: %LINK-3-UPDOWN: Interface GigabitEthernet0/4, >>> > changed state to up >>> > Feb 21 09:37:08.289: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>> > GigabitEthernet0/11, changed state to up >>> > Feb 21 09:37:09.472: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>> > GigabitEthernet0/4, changed state to up >>> > Feb 21 09:40:20.045: %LINEPROTO-5-UPDOWN: Line protocol on Interface >>> > GigabitEthernet0/11, changed state to down >>> > Feb 21 09:40:21.043: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>> > changed state to down >>> > Feb 21 09:40:23.401: %LINK-3-UPDOWN: Interface GigabitEthernet0/11, >>> > changed state to up >>> > _______________________________________________ >>> > discuss mailing list >>> > discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>> > http://lists.corosync.org/mailman/listinfo/discuss >>> > >>> >>> _______________________________________________ >>> discuss mailing list >>> discuss@xxxxxxxxxxxx <mailto:discuss@xxxxxxxxxxxx> >>> http://lists.corosync.org/mailman/listinfo/discuss >>> >>> >> _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss