Re: Corosync 2.3.3 memory leak

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



zouyu,

> Hi jfriesse,
> I can reproduce this kind of memory leak. I meet with this problem some
> days ago. Below are the reproduce steps:
> 1. install two nodes, n1 and n2.
> 2. on n1, start corosync and run 'testcpg cpg_bm'.
> 3. on n2, start corosync and run 'cpgbench'.
> 4. wait about 2 mins, and corosync on n1 will be killed which contains
> about more than 75% memory.
> 
> Actually, 1-node cluster can also reproduce this problem, such that
> 'testcpg cpg_bm' and 'cpgbench' run on the same node, and corosync will
> be killed about 2 mins.
> 

yep. But this is a totally different problem. What will happen in this
case scenario is:
- cpgbench emits messages as fast as it can
- testcpg is trying to receive them and write to console. And console
writing is slow, so corosync queues dispatch messages -> problem

Actually, if you will run testcpg and redirect output to /dev/null, it
will not happen.

Sadly there is really not too much to do with this concrete problem. We
cannot simply throw messages. I can imagine to have have some kind of
global cpg queuing limit, but there can be still malicious app which
decide to not process received events -> effective DOS.

So only thing we can (in theory) do, is to kill (disconnect) application
which dispatch buffer is too big. But there is classic question. What is
too big? What if some application need more?

Regards,
  Honza

> I have tried to use valgrind to catch the corosync, but it doesn't work
> and valgrind doesn't output the valuable things.
> 
> On 2014年08月06日 18:51, "Tomcsányi, Domonkos" wrote:
>> Hello Everyone,
>>
>> I think I might have isolated the problem!
>>
>> Starting from this thread:
>> http://forum.proxmox.com/threads/14263-Proxmox-3-0-Cluster-corosync-running-system-out-of-memory
>>
>>
>> I became suspicous and started to look at my syslog (IP address
>> intentionally changed):
>>
>> Aug 6 12:46:41 db-01 corosync[22339]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Aug 6 12:46:45 db-01 corosync[22339]: [TOTEM ] A new membership
>> (1.2.3.4:591376) was formed. Members
>> Aug 6 12:46:45 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020
>> Aug 6 12:46:45 db-01 corosync[22339]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Aug 6 12:46:49 db-01 corosync[22339]: [TOTEM ] A new membership
>> (1.2.3.4:591380) was formed. Members
>> Aug 6 12:46:49 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020
>> Aug 6 12:46:49 db-01 corosync[22339]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Aug 6 12:46:53 db-01 corosync[22339]: [TOTEM ] A new membership
>> (1.2.3.4:591384) was formed. Members
>> Aug 6 12:46:53 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020
>> Aug 6 12:46:53 db-01 corosync[22339]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>> Aug 6 12:46:56 db-01 corosync[22339]: [TOTEM ] A new membership
>> (1.2.3.4:591388) was formed. Members
>> Aug 6 12:46:56 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020
>> Aug 6 12:46:56 db-01 corosync[22339]: [MAIN ] Completed service
>> synchronization, ready to provide service.
>>
>> Looking at my other setup I don't see any messages like this. So, the
>> constant re-forming of the cluster is causing corosync to eat up all
>> the memory. Now I will start investigating on the network level to
>> see, what exactly happens there, why is there a constant changing in
>> the cluster, but still as the thread mentioned above says I think it
>> shouldn't cause such leakage of memory.
>>
>> regards,
>> Domonkos
>>
>>
>> 2014.07.31. 11:37 keltezéssel, Jan Friesse írta:
>>> Domonkos,
>>>
>>>
>>>> 2014.07.30. 18:10 keltezéssel, "Tomcsányi, Domonkos" írta:
>>>>> 2014.07.30. 15:51 keltezéssel, Jan Friesse wrote:
>>>>>> ok. I was trying reproduce your bug, sadly I was not very successful.
>>>>>>
>>>>>> Can you please try to reconfigure your postgres nodes to similar
>>>>>> configuration like on your apache nodes? This will help me to
>>>>>> identify if problem is happening with postgres resource only, or with
>>>>>> all resources and it's problem in corosync/libqb.
>>>>>>
>>>>>> Thanks,
>>>>>> Honza
>>>>>
>>>>> Well, I did my best: I put the nodes into standby, so no resources run
>>>>> on them - no change at all, corosync still eats memory heavily.
>>>>> I think it leaves us not much doubt about what is causing it.
>>>>>
>>>>> So here is a way to reproduce it: install Ubuntu 14.04 LTS, install
>>>>> 0.17 libqb either from a PPA, or by compiling it.
>>>>>
>>>>> I will create now a clean virtual machine without any resources and
>>>>> see if the same happens.
>>>>>
>>>>> Domonkos
>>>>>
>>>> Couldn't reproduce the issue yet in my clean virtual machines, so I'm
>>>> gonna leave corosync running inside of valgrind on the machines I had
>>>> problems for a night and see what happens.
>>>>
>>>
>>> Perfect. Hopefully you will be able to find out reproducer.
>>>
>>> Regards,
>>> Honza
>>>
>>>> Domonkos
>>>>
>>>> _______________________________________________
>>>> discuss mailing list
>>>> discuss@xxxxxxxxxxxx
>>>> http://lists.corosync.org/mailman/listinfo/discuss
>>>
>>
>>
>> _______________________________________________
>> discuss mailing list
>> discuss@xxxxxxxxxxxx
>> http://lists.corosync.org/mailman/listinfo/discuss
>>
> 

_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss





[Index of Archives]     [Linux Clusters]     [Corosync Project]     [Linux USB Devel]     [Linux Audio Users]     [Photo]     [Yosemite News]    [Yosemite Photos]    [Linux Kernel]     [Linux SCSI]     [X.Org]

  Powered by Linux