Re: daemon cpg_join error retrying

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



> 
>> On 30 Oct 2014, at 9:32 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote:
>>
>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>> Ok.
>>
>>>>
>>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>>>
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>>
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>>> It does not sound like your network is particularly healthy.
>>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
>>
>> I am using udpu and I also have firewall opened for ports 5404 & 5405. Tcpdump looks fine too, it does not complain of any issues. This is a VM envirornment and even if I switch to other node within same VM I keep getting same failure.  
> 
> Depending on what the host and VMs are doing, that might be your problem.
> In any case, I will defer to the corosync guys at this point.
> 

Lax,
usual reasons for this problem:
1. mtu is too high and fragmented packets are not enabled (take a look
to netmtu configuration option)
2. config file on nodes are not in sync and one node may contain more
node entries then other nodes (this may be also the case if you have two
clusters and one cluster contains entry of one node for other cluster)
3. firewall is asymmetrically blocked (so node can send but not
receive). Also keep in mind that ports 5404 & 5405 may not be enough for
udpu, because udpu uses one socket per remote node for sending.

I would recommend to disable firewall completely (for testing) and if
everything will work, you just need to adjust firewall.

Regards,
  Honza



>>
>> Thanks
>> Lax
>>
>>
>>
>> -----Original Message-----
>> From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Andrew Beekhof
>> Sent: Wednesday, October 29, 2014 3:17 PM
>> To: linux clustering
>> Subject: Re:  daemon cpg_join error retrying
>>
>>
>>> On 30 Oct 2014, at 9:06 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote:
>>>
>>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>> How to check  cluster name of GFS file system? I had similar configuration running fine in multiple other setups with no such issue.
>>
>> I don't really recall. Hopefully someone more familiar with GFS2 can chime in.
>>
>>>
>>> Also one more issue I am seeing in one other setup a repeated flood of 'A processor joined or left the membership and a new membership was formed' messages for every 4secs. I am running with default TOTEM settings with token time out as 10 secs. Even after I increase the token, consensus values to be higher. It goes on flooding the same message after newer consensus defined time (eg: if I increase it to be 10secs, then I see new membership formed messages for every 10secs)
>>>
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:10 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>>
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [CPG   ] chosen downlist: sender r(0) ip(172.28.0.64) ; members(old:2 left:0)
>>> Oct 29 14:58:14 VSM76-VSOM64 corosync[28388]:   [MAIN  ] Completed service synchronization, ready to provide service.
>>
>> It does not sound like your network is particularly healthy.
>> Are you using multicast or udpu? If multicast, it might be worth trying udpu
>>
>>>
>>> Thanks
>>> Lax
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces@xxxxxxxxxx [mailto:linux-cluster-bounces@xxxxxxxxxx] On Behalf Of Andrew Beekhof
>>> Sent: Wednesday, October 29, 2014 2:42 PM
>>> To: linux clustering
>>> Subject: Re:  daemon cpg_join error retrying
>>>
>>>
>>>> On 30 Oct 2014, at 8:38 am, Lax Kota (lkota) <lkota@xxxxxxxxx> wrote:
>>>>
>>>> Hi All,
>>>>
>>>> In one of my setup, I keep getting getting 'gfs_controld[10744]: daemon cpg_join error  retrying'. I have a 2 Node setup with pacemaker and corosync.
>>>
>>> I wonder if there is a mismatch between the cluster name in cluster.conf and the cluster name the GFS filesystem was created with.
>>>
>>>>
>>>> Even after I force kill the pacemaker processes and reboot the server and bring the pacemaker back up, it keeps giving cpg_join error. Is  there any way to fix this issue?  
>>>>
>>>>
>>>> Thanks
>>>> Lax
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster@xxxxxxxxxx
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster@xxxxxxxxxx
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster@xxxxxxxxxx
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 

-- 
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster




[Index of Archives]     [Corosync Cluster Engine]     [GFS]     [Linux Virtualization]     [Centos Virtualization]     [Centos]     [Linux RAID]     [Fedora Users]     [Fedora SELinux]     [Big List of Linux Books]     [Yosemite Camping]

  Powered by Linux