Hi jfriesse,ok, i tested it again with a small round of cpgbench, the memory usage of corosync went into a stable state in the end. And htere is no memory leak for this scenario.
Thanks for your help. On 08/07/2014 04:36 PM, Jan Friesse wrote:
zouyu, zouyu napsal(a):jfriesse, yes, i use top to check the memory usage by corosync. After i stop cpgbench and wait for some time and then restart cpgbench, the corosync takes more memory than before, from 70% to 75% and then it is killed by the system. The corosync doesn't hold the same portion of the memory.Ok, I was testing it right now and after some time corosync is holding amount of memory (as expected).I've changed cpgbench: diff --git a/test/cpgbench.c b/test/cpgbench.c index 23fc2b0..ec13302 100644 --- a/test/cpgbench.c +++ b/test/cpgbench.c @@ -180,7 +180,7 @@ int main (void) { exit (1); }- for (i = 0; i < 10; i++) { /* number of repetitions - up to 50k */+ for (i = 0; i < 2; i++) { /* number of repetitions - up to 50k */ cpg_benchmark (handle, size); signal (SIGALRM, sigalrm_handler); size *= 5;So one run of it will not take all memory. Then exec testcpg and cpgbench (complete run). After 2-3 cpbench runs (and of course let testcpg process ALL messages), memory usage stop increasing and holds still or decrease.HonzaNote, i restart cpgbench and let it run shorter time than before, so that we can check that the memory goes up. On 08/07/2014 03:01 PM, Jan Friesse wrote:zouyu,Hi jfriesse,I changed the subject to distinguish the original 'memory leak' problem.for this kind of case, after i stop the cpgbench program and wait forthe testcpg to deliver and display all the messages to console, and waitfor about 1 hour, the corosync still holds the memory it occupies. In my test, when corosync holds about 70% memory in the system, i stop the cpgbench, and wait testcpg to display all remaining messages to console, but the corosync doesn't shrink the memory it occupies, it still holds 70% memory.This is classic problem. How are you measuring it (I believe by top or something like that, right?)? So problem is, that free call in glibc doesn't always mean real free of memory. glibc (or generally libc) still hold some memory. So give a try to following scenario: - exec cpgbench and testcpg as in your scenario - stop cpgbench and wait for testcpg to process all messages (don't turn testcpg off) - exec cpgbench again What you should see (and if not, there is really probably bug) is corosync holding still same portion of memory. Regards, Honzais this kind of behavior suitable or corosync should shrink the queue ituses and free some memory to the system? On 08/06/2014 09:38 PM, Jan Friesse wrote:zouyu,yep. But this is a totally different problem. What will happen in thisHi jfriesse, I can reproduce this kind of memory leak. I meet with this problem some days ago. Below are the reproduce steps: 1. install two nodes, n1 and n2. 2. on n1, start corosync and run 'testcpg cpg_bm'. 3. on n2, start corosync and run 'cpgbench'.4. wait about 2 mins, and corosync on n1 will be killed which containsabout more than 75% memory. Actually, 1-node cluster can also reproduce this problem, such that 'testcpg cpg_bm' and 'cpgbench' run on the same node, and corosync will be killed about 2 mins.case scenario is: - cpgbench emits messages as fast as it can - testcpg is trying to receive them and write to console. And console writing is slow, so corosync queues dispatch messages -> problemActually, if you will run testcpg and redirect output to /dev/null, itwill not happen.Sadly there is really not too much to do with this concrete problem. Wecannot simply throw messages. I can imagine to have have some kind of global cpg queuing limit, but there can be still malicious app which decide to not process received events -> effective DOS. So only thing we can (in theory) do, is to kill (disconnect) application which dispatch buffer is too big. But there is classic question. What is too big? What if some application need more? Regards, HonzaI have tried to use valgrind to catch the corosync, but it doesn't work and valgrind doesn't output the valuable things. On 2014年08月06日 18:51, "Tomcsányi, Domonkos" wrote:Hello Everyone, I think I might have isolated the problem! Starting from this thread:http://forum.proxmox.com/threads/14263-Proxmox-3-0-Cluster-corosync-running-system-out-of-memoryI became suspicous and started to look at my syslog (IP address intentionally changed): Aug 6 12:46:41 db-01 corosync[22339]: [MAIN ] Completed service synchronization, ready to provide service. Aug 6 12:46:45 db-01 corosync[22339]: [TOTEM ] A new membership (1.2.3.4:591376) was formed. MembersAug 6 12:46:45 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020Aug 6 12:46:45 db-01 corosync[22339]: [MAIN ] Completed service synchronization, ready to provide service. Aug 6 12:46:49 db-01 corosync[22339]: [TOTEM ] A new membership (1.2.3.4:591380) was formed. MembersAug 6 12:46:49 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020Aug 6 12:46:49 db-01 corosync[22339]: [MAIN ] Completed service synchronization, ready to provide service. Aug 6 12:46:53 db-01 corosync[22339]: [TOTEM ] A new membership (1.2.3.4:591384) was formed. MembersAug 6 12:46:53 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020Aug 6 12:46:53 db-01 corosync[22339]: [MAIN ] Completed service synchronization, ready to provide service. Aug 6 12:46:56 db-01 corosync[22339]: [TOTEM ] A new membership (1.2.3.4:591388) was formed. MembersAug 6 12:46:56 db-01 corosync[22339]: [QUORUM] Members[1]: 171707020Aug 6 12:46:56 db-01 corosync[22339]: [MAIN ] Completed service synchronization, ready to provide service.Looking at my other setup I don't see any messages like this. So, the constant re-forming of the cluster is causing corosync to eat up allthe memory. Now I will start investigating on the network level tosee, what exactly happens there, why is there a constant changing in the cluster, but still as the thread mentioned above says I think itshouldn't cause such leakage of memory. regards, Domonkos 2014.07.31. 11:37 keltezéssel, Jan Friesse írta:Domonkos,2014.07.30. 18:10 keltezéssel, "Tomcsányi, Domonkos" írta:2014.07.30. 15:51 keltezéssel, Jan Friesse wrote:ok. I was trying reproduce your bug, sadly I was not very successful.Can you please try to reconfigure your postgres nodes to similarconfiguration like on your apache nodes? This will help me toidentify if problem is happening with postgres resource only, orwith all resources and it's problem in corosync/libqb. Thanks, HonzaWell, I did my best: I put the nodes into standby, so no resources run on them - no change at all, corosync still eats memory heavily. I think it leaves us not much doubt about what is causing it. So here is a way to reproduce it: install Ubuntu 14.04 LTS, install 0.17 libqb either from a PPA, or by compiling it. I will create now a clean virtual machine without any resources and see if the same happens. DomonkosCouldn't reproduce the issue yet in my clean virtual machines, so I'm gonna leave corosync running inside of valgrind on the machines I had problems for a night and see what happens.Perfect. Hopefully you will be able to find out reproducer. Regards, HonzaDomonkos _______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss
_______________________________________________ discuss mailing list discuss@xxxxxxxxxxxx http://lists.corosync.org/mailman/listinfo/discuss