Re: [Pacemaker] What is the reason which the node in which failure has not occurred carries out "lost"?

yusuke iida <yusk.iida@xxxxxxxxx> · Tue, 18 Feb 2014 19:53:05 +0900

Hi, Andrew and Digimer

Thank you for the comment.

I solved with reference to other mailing list about this problem.
https://bugzilla.redhat.com/show_bug.cgi?id=880035

It seems that the kernel of my environment was old when said from the
conclusion.
It updated to the newest kernel now.
kernel-2.6.32-431.5.1.el6.x86_64.rpm

The following parameters are set to bridge which is letting
communication of corosync pass now.
As a result, "Retransmit List" no longer occur almost.
# echo 1 > /sys/class/net/<bridge>/bridge/multicast_querier
# echo 0 > /sys/class/net/<bridge>/bridge/multicast_snooping

2014-02-18 9:49 GMT+09:00 Andrew Beekhof <andrew@xxxxxxxxxxx>:
>
> On 31 Jan 2014, at 6:20 pm, yusuke iida <yusk.iida@xxxxxxxxx> wrote:
>
>> Hi, all
>>
>> I measure the performance of Pacemaker in the following combinations.
>> Pacemaker-1.1.11.rc1
>> libqb-0.16.0
>> corosync-2.3.2
>>
>> All nodes are KVM virtual machines.
>>
>>  stopped the node of vm01 compulsorily from the inside, after starting 14 nodes.
>> "virsh destroy vm01" was used for the stop.
>> Then, in addition to the compulsorily stopped node, other nodes are separated from a cluster.
>>
>> The log of "Retransmit List:" is then outputted in large quantities from corosync.
>
> Probably best to poke the corosync guys about this.
>
> However, <= .11 is known to cause significant CPU usage with that many nodes.
> I can easily imagine this staving corosync of resources and causing breakage.
>
> I would _highly_ recommend retesting with the current git master of pacemaker.
> I merged the new cib code last week which is faster by _two_ orders of magnitude and uses significantly less CPU.
>
> I'd be interested to hear your feedback.
Since I am very interested in this, I would like to test, although the
problem of "Retransmit List" was solved.
Please wait for a result a little.

Thanks,
Yusuke

>
>>
>> What is the reason which the node in which failure has not occurred carries out "lost"?
>>
>> Please advise, if there is a problem in a setup in something.
>>
>> I attached the report when the problem occurred.
>> https://drive.google.com/file/d/0BwMFJItoO-fVMkFWWWlQQldsSFU/edit?usp=sharing
>>
>> Regards,
>> Yusuke
>> --
>> ----------------------------------------
>> METRO SYSTEMS CO., LTD
>>
>> Yusuke Iida
>> Mail: yusk.iida@xxxxxxxxx
>> ----------------------------------------
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker@xxxxxxxxxxxxxxxxxxx
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>

-- 
----------------------------------------
METRO SYSTEMS CO., LTD

Yusuke Iida
Mail: yusk.iida@xxxxxxxxx
----------------------------------------
_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss