Re: [Problem] Corosync cannot reconstitute a cluster.

Jan Friesse <jfriesse@xxxxxxxxxx> · Wed, 12 Jun 2013 09:38:24 +0200

Hideo,
can you please tell me exact reproducer for physical hw? (because brctl
delif is I believe not valid in hw at all).

Thanks,
  Honza

renayama19661014@xxxxxxxxx napsal(a):
> Hi Fabio,
> 
> Thank you for comment.
> 
>> I'll let Honza look at it, I don't have enough physical hardware to
>> reproduce.
> 
> All right.
> 
> Many Thanks!
> Hideo Yamauchi.
> 
> 
> --- On Tue, 2013/6/11, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
> 
>> Hi Yamauchi-san,
>>
>> I'll let Honza look at it, I don't have enough physical hardware to
>> reproduce.
>>
>> Fabio
>>
>> On 06/11/2013 01:15 AM, renayama19661014@xxxxxxxxx wrote:
>>> Hi Fabio,
>>>
>>> Thank you for comments.
>>>
>>> We confirmed this problem in the physical environment.
>>> The communication of corosync lets eth1,eth2 go through.
>>>
>>> -------------------------------------------------------
>>> [root@bl460g6a ~]# ip addr show
>>> (snip)
>>> 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
>>>      link/ether f4:ce:46:b3:fe:3c brd ff:ff:ff:ff:ff:ff
>>>      inet 192.168.101.9/24 brd 192.168.101.255 scope global eth1
>>>      inet6 fe80::f6ce:46ff:feb3:fe3c/64 scope link 
>>>         valid_lft forever preferred_lft forever
>>> 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP qlen 1000
>>>      link/ether 18:a9:05:78:6c:f0 brd ff:ff:ff:ff:ff:ff
>>>      inet 192.168.102.9/24 brd 192.168.102.255 scope global eth2
>>>      inet6 fe80::1aa9:5ff:fe78:6cf0/64 scope link 
>>>         valid_lft forever preferred_lft forever
>>> (snip)
>>> 8: virbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
>>>      link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
>>>      inet 192.168.122.1/24 brd 192.168.122.255 scope global virbr0
>>> 9: virbr0-nic: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 500
>>>      link/ether 52:54:00:7f:f3:0a brd ff:ff:ff:ff:ff:ff
>>> -----------------------------------------------
>>>
>>> I think that it is not a virtual environmental problem.
>>>
>>> I attach the log that I confirmed just to make sure in three Blade.(RHEL6.4)
>>> * I performed the interception of the communication with a network switch.
>>>
>>> The phenomenon is similar, and, as for one node, a loop does an OPERATIONAL state, and two other nodes do not change in an OPERATIONAL state.
>>>
>>> After all is the problem same as the bug that you taught?
>>>> Check this thread as reference:
>>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
>>>
>>>
>>> Best Regards,
>>> Hideo Yamauchi.
>>>
>>>
>>>
>>> --- On Fri, 2013/5/31, Fabio M. Di Nitto <fdinitto@xxxxxxxxxx> wrote:
>>>
>>>> On 5/31/2013 7:12 AM, renayama19661014@xxxxxxxxx wrote:
>>>>> Hi All,
>>>>>
>>>>> We discovered the problem of the network of the corosync communication.
>>>>>
>>>>> We composed a cluster of three nodes on KVM in corosync.
>>>>>
>>>>> Step 1) Start corosync service in all nodes. 
>>>>>
>>>>> Step 2) Confirm that a cluster is comprised of all nodes definitely and became the OPERATIONAL state.
>>>>>
>>>>> Step 3) Cut off the network of node1(rh64-coro1) and node2(rh64-coro2) from a host of KVM.
>>>>>
>>>>>          [root@kvm-host ~]# brctl delif virbr3 vnet5;brctl delif virbr2 vnet1
>>>>>
>>>>> Step 4) Because a problem occurred, we stop all nodes.
>>>>>
>>>>>
>>>>> The problem occurs at the time of step 3.
>>>>>
>>>>> One node(rh64-coro1) continues moving a state after becoming the OPERATIONAL state.
>>>>>
>>>>> Two nodes(rh64-coro2 and rh64-coro3) continue changing in a state.
>>>>> It seems to never change in an OPERATIONAL state while the first node operates.
>>>>>
>>>>> This means that two nodes(rh64-coro2 and rh64-coro3) cannot complete cluster constitution.
>>>>> When this network trouble happens, by the setting that corosync combined with Pacemaker, corosync cannot notify Pacemaker of the constitution change of the cluster.
>>>>>
>>>>>
>>>>> Question 1) Are there any parameters to solve this problem in corosync.conf?
>>>>>    * We bundle up an interface(Bonding) and think that it can be settled by appointing "rrp_mode:none", but do not want to appoint "rrp_mode:none".
>>>>>
>>>>> Question 2) Is this a bug? Or is it specifications of the communication of corosync?
>>>>
>>>> We already checked this specific test, and it appears to be a bug in
>>>> the kernel bridge code when handling multicast traffic (groups are not
>>>> joined correctly and traffic is not forwarded).
>>>>
>>>> Check this thread as reference:
>>>> http://lists.linuxfoundation.org/pipermail/openais/2013-April/016792.html
>>>>
>>>> Thanks
>>>> Fabio
>>>>
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>>
>>
>>
> 
> _______________________________________________
> discuss mailing list
> discuss@xxxxxxxxxxxx
> http://lists.corosync.org/mailman/listinfo/discuss

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss