> This is just weird. What exact version of corosync are you running? Do you have latest Z stream? I am running on Corosync 1.4.1 and pacemaker version is 1.1.8-7.el6 Thanks Lax -----Original Message----- From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jan Friesse Sent: Friday, October 31, 2014 9:43 AM To: linux clustering Subject: Re: daemon cpg_join error retrying Lax, > Thanks Honza. Here is what I was doing, > >> usual reasons for this problem: >> 1. mtu is too high and fragmented packets are not enabled (take a >> look to netmtu configuration option) > I am running with default mtu settings which is 1500. And I do see my interface(eth1) on the box does have MTU as 1500 too. > Keep in mind that if they are not directly connected, switch can throw packets because of MTU. > > 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two > clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > Verfiifed my config files cluster.conf and cib.xml and both have same > no of node entries (2) > >> I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > I also ran tests with firewall off too on both the participating > nodes, still see same issue > > In corosync log I see repeated set of these messages, hoping these will give some more pointers. > > Oct 29 22:11:02 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:02 corosync [MAIN ] Completed service synchronization, ready to provide service. > Oct 29 22:11:02 corosync [TOTEM ] waiting_trans_ack changed to 0 Oct > 29 22:11:03 corosync [TOTEM ] entering GATHER state from 11. > Oct 29 22:11:03 corosync [TOTEM ] entering GATHER state from 10. > Oct 29 22:11:05 corosync [TOTEM ] entering GATHER state from 0. This is just weird. What exact version of corosync are you running? Do you have latest Z stream? Regards, Honza > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] Saving state aru 1b high seq received 1b Oct 29 > 22:11:05 corosync [TOTEM ] Storing new sequence id for ring 51708 Oct > 29 22:11:05 corosync [TOTEM ] entering COMMIT state. > Oct 29 22:11:05 corosync [TOTEM ] got commit token Oct 29 22:11:05 > corosync [TOTEM ] entering RECOVERY state. > Oct 29 22:11:05 corosync [TOTEM ] TRANS [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] TRANS [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] position [0] member 172.28.0.64: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] position [1] member 172.28.0.65: > Oct 29 22:11:05 corosync [TOTEM ] previous ring seq 333572 rep > 172.28.0.64 Oct 29 22:11:05 corosync [TOTEM ] aru 1b high delivered 1b > received flag 1 Oct 29 22:11:05 corosync [TOTEM ] Did not need to originate any messages in recovery. > Oct 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set > retrans flag0 retrans queue empty 1 count 0, aru ffffffff Oct 29 > 22:11:05 corosync [TOTEM ] install seq 0 aru 0 high seq received 0 Oct > 29 22:11:05 corosync [TOTEM ] token retrans flag is 0 my set retrans > flag0 retrans queue empty 1 count 1, aru 0 Oct 29 22:11:05 corosync > [TOTEM ] install seq 0 aru 0 high seq received 0 Oct 29 22:11:05 > corosync [TOTEM ] token retrans flag is 0 my set retrans flag0 retrans > queue empty 1 count 2, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install > seq 0 aru 0 high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] > token retrans flag is 0 my set retrans flag0 retrans queue empty 1 > count 3, aru 0 Oct 29 22:11:05 corosync [TOTEM ] install seq 0 aru 0 > high seq received 0 Oct 29 22:11:05 corosync [TOTEM ] retrans flag > count 4 token aru 0 install seq 0 aru 0 0 Oct 29 22:11:05 corosync > [TOTEM ] Resetting old ring state Oct 29 22:11:05 corosync [TOTEM ] > recovery to regular 1-0 Oct 29 22:11:05 corosync [CMAN ] ais: > confchg_fn called type = 1, seq=333576 Oct 29 22:11:05 corosync [TOTEM > ] waiting_trans_ack changed to 1 Oct 29 22:11:05 corosync [CMAN ] > ais: confchg_fn called type = 0, seq=333576 Oct 29 22:11:05 corosync > [CMAN ] ais: last memb_count = 2, current = 2 Oct 29 22:11:05 > corosync [CMAN ] memb: sending TRANSITION message. cluster_name = vsomcluster Oct 29 22:11:05 corosync [CMAN ] ais: comms send message 0x7fff8185ca00 len = 65 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 24 Oct 29 22:11:05 corosync [CMAN ] daemon: sending reply 103 to fd 34 Oct 29 22:11:05 corosync [SYNC ] This node is within the primary component and will provide service. > Oct 29 22:11:05 corosync [TOTEM ] entering OPERATIONAL state. > Oct 29 22:11:05 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 2, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 2 Oct 29 22:11:05 corosync [CMAN ] memb: Got TRANSITION > message. msg->flags=20, node->flags=20, first_trans=0 Oct 29 22:11:05 > corosync [CMAN ] memb: add_ais_node ID=2, incarnation = 333576 Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [CMAN ] ais: deliver_fn source nodeid = 1, > len=81, endian_conv=0 Oct 29 22:11:05 corosync [CMAN ] memb: Message > on port 0 is 5 Oct 29 22:11:05 corosync [CMAN ] memb: got TRANSITION > from node 1 Oct 29 22:11:05 corosync [CMAN ] Completed first > transition with nodes on the same config versions Oct 29 22:11:05 > corosync [CMAN ] memb: Got TRANSITION message. msg->flags=20, > node->flags=20, first_trans=0 Oct 29 22:11:05 corosync [CMAN ] memb: > add_ais_node ID=1, incarnation = 333576 Oct 29 22:11:05 corosync [SYNC > ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] confchg entries > 2 Oct 29 22:11:05 corosync [SYNC ] Barrier Start Received From 1 Oct > 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy CLM service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (dummy AMF service) Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy AMF service) Oct 29 22:11:05 corosync [SYNC ] Synchronization > actions starting for (openais checkpoint service B.01.01) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync [SYNC ] Barrier > Start Received From 1 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (openais checkpoint service B.01.01) Oct 29 22:11:05 corosync [SYNC ] > Synchronization actions starting for (dummy EVT service) Oct 29 > 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 corosync > [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 1 = 0. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (dummy EVT service) Oct 29 22:11:05 corosync [SYNC ] Synchronization actions starting for (corosync cluster closed process group service v1.01) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 1 > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.65) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] comparing: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) > Oct 29 22:11:05 corosync [CPG ] got joinlist message from node 2 > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 1 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 0. > Oct 29 22:11:05 corosync [SYNC ] confchg entries 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier Start Received From 2 Oct 29 22:11:05 > corosync [SYNC ] Barrier completion status for nodeid 1 = 1. > Oct 29 22:11:05 corosync [SYNC ] Barrier completion status for nodeid 2 = 1. > Oct 29 22:11:05 corosync [SYNC ] Synchronization barrier completed > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[0] group:crmd\x00, ip:r(0) ip(172.28.0.65) , pid:9198 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[1] group:attrd\x00, ip:r(0) ip(172.28.0.65) , pid:9196 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[2] group:stonith-ng\x00, ip:r(0) ip(172.28.0.65) , pid:9194 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[3] group:cib\x00, ip:r(0) ip(172.28.0.65) , pid:9193 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[4] group:pcmk\x00, ip:r(0) ip(172.28.0.65) , pid:9187 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[5] group:gfs:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9111 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[6] group:dlm:controld\x00, ip:r(0) ip(172.28.0.65) , pid:9057 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[7] group:fenced:default\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[8] group:fenced:daemon\x00, ip:r(0) ip(172.28.0.65) , pid:9040 > Oct 29 22:11:05 corosync [CPG ] joinlist_messages[9] group:crmd\x00, ip:r(0) ip(172.28.0.64) , pid:14530 > Oct 29 22:11:05 corosync [SYNC ] Committing synchronization for > (corosync cluster closed process group service v1.01) Oct 29 22:11:05 corosync [MAIN ] Completed service synchronization, ready to provide service. > > Thanks > Lax > > > -----Original Message----- > From: linux-cluster-bounces@xxxxxxxxxx > [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Jan Friesse > Sent: Thursday, October 30, 2014 1:23 AM > To: linux clustering > Subject: Re: daemon cpg_join error retrying > >> >>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote: >>> >>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> Ok. >>> >>>>> >>>>> Also one more issue I am seeing in one other setup a repeated >>>>> flood of 'A processor joined or left the membership and a new >>>>> membership was formed' messages for every 4secs. I am running with >>>>> default TOTEM settings with token time out as 10 secs. Even after >>>>> I increase the token, consensus values to be higher. It goes on >>>>> flooding the same message after newer consensus defined time (eg: >>>>> if I increase it to be 10secs, then I see new membership formed >>>>> messages for every 10secs) >>>>> >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>>> >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>>> It does not sound like your network is particularly healthy. >>>> Are you using multicast or udpu? If multicast, it might be worth >>>> trying udpu >>> >>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure. >> >> Depending on what the host and VMs are doing, that might be your problem. >> In any case, I will defer to the corosync guys at this point. >> > > Lax, > usual reasons for this problem: > 1. mtu is too high and fragmented packets are not enabled (take a look to netmtu configuration option) 2. config file on nodes are not in sync and one node may contain more node entries then other nodes (this may be also the case if you have two clusters and one cluster contains entry of one node for other cluster) 3. firewall is asymmetrically blocked (so node can send but not receive). Also keep in mind that ports 5404 & 5405 may not be enough for udpu, because udpu uses one socket per remote node for sending. > > I would recommend to disable firewall completely (for testing) and if everything will work, you just need to adjust firewall. > > Regards, > Honza > > > >>> >>> Thanks >>> Lax >>> >>> >>> >>> -----Original Message----- >>> From: linux-cluster-bounces@xxxxxxxxxx >>> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Andrew >>> Beekhof >>> Sent: Wednesday, October 29, 2014 3:17 PM >>> To: linux clustering >>> Subject: Re: daemon cpg_join error retrying >>> >>> >>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote: >>>> >>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> How to check cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue. >>> >>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in. >>> >>>> >>>> Also one more issue I am seeing in one other setup a repeated flood >>>> of 'A processor joined or left the membership and a new membership >>>> was formed' messages for every 4secs. I am running with default >>>> TOTEM settings with token time out as 10 secs. Even after I >>>> increase the token, consensus values to be higher. It goes on >>>> flooding the same message after newer consensus defined time (eg: >>>> if I increase it to be 10secs, then I see new membership formed >>>> messages for every >>>> 10secs) >>>> >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>>> >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [TOTEM ] A processor joined or left the membership and a new membership was formed. >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [CPG ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0) >>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]: [MAIN ] Completed service synchronization, ready to provide service. >>> >>> It does not sound like your network is particularly healthy. >>> Are you using multicast or udpu? If multicast, it might be worth >>> trying udpu >>> >>>> >>>> Thanks >>>> Lax >>>> >>>> >>>> -----Original Message----- >>>> From: linux-cluster-bounces@xxxxxxxxxx >>>> [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Andrew >>>> Beekhof >>>> Sent: Wednesday, October 29, 2014 2:42 PM >>>> To: linux clustering >>>> Subject: Re: daemon cpg_join error retrying >>>> >>>> >>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error retrying'. I have a 2 Node setup with pacemaker and corosync. >>>> >>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with. >>>> >>>>> >>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is there any way to fix this issue? >>>>> >>>>> >>>>> Thanks >>>>> Lax >>>>> >>>>> -- >>>>> Linux-cluster mailing list >>>>> Linux-cluster@xxxxxxxxxx >>>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>>> >>>> -- >>>> Linux-cluster mailing list >>>> Linux-cluster@xxxxxxxxxx >>>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >>> >>> -- >>> Linux-cluster mailing list >>> Linux-cluster@xxxxxxxxxx >>> https://www.redhat.com/mailman/listinfo/linux-cluster >> >> > > -- > Linux-cluster mailing list > Linux-cluster@xxxxxxxxxx > https://www.redhat.com/mailman/listinfo/linux-cluster > -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list Linux-cluster@xxxxxxxxxx https://www.redhat.com/mailman/listinfo/linux-cluster