Re: memory leak -- a bad one

Jan Friesse <jfriesse@xxxxxxxxxx> · Mon, 11 May 2015 09:54:58 +0200

Moving back to list.

otheus uibk napsal(a):
Logs from the 4 days ago indicate the memory back then was stable... 329 MB
and < 17 MB after a restart. One difference may be that on the 4th of May,
no application was using Corosync.  So today, after reverting to the
previous configuration, on the same host, without any other configuration
changes, the RES creeped up to 8.7G over the span of less than 6 hours.
Strange.

What application are you using? What is "process"?


Sample logs. Again, Time / RSS / SZ parameters:

23:32:01 8292 21363   # after process started
23:36:01 121584 50821
23:47:01 410976 127128
23:57:01 599244 196763
00:07:01 855768 266144
00:17:02 1203376 335780
00:27:01 1221156 405186
  ...
01:27:01 2963996 821724
02:27:01 4660864 1238277
03:27:01 6495140 1653085
04:27:01 6548324 2068631
05:27:01 9068520 2477199
06:07:01 9104724 2754738


On Thu, May 7, 2015 at 11:25 PM, otheus uibk <otheus.uibk@xxxxxxxxx> wrote:

Reproducible.

libqb 0.16.0, release 2.el6 on the systems with and without the memory
leak problem.

So you are using RHEL 6 with your own corosync 2.3.4 package?

On the memory leaking systems, nspr is 4.10.2. On the non-memory leak,
4.10.8.
However, 3 days earlier, before the configuration change, there was no
memory leak.

I have a cronjob which captures the output of ps every minute.  Here are
the "rss",and "sz" columns of corosync, with times.

17:53:01 3952 18271
17:54:02 18496 21889
17:58:01 70588 37823
18:00:01 113904 45790
18:06:01 207264 69177
18:10:01 245152 83857
18:15:01 338212 103154
18:20:01 402860 123200
...
19:00:01 985844 282619
19:10:01 1058220 322711
...
20:00:01 1960572 522400
20:30:01 2289444 642419

Current configuration (with IPs smudged):

quroum {
         provider: corosync_votequorum
         expected_votes: 2
}
aisexec {
         user:root
         group:root
}
#service {
#       name: pacemaker
#       ver: 1
#}
totem {
         version: 2

         # if "on" must use shared "corosync-keygen".
         secauth: off
         threads: 2
         rrp_mode: none
         transport: udpu
         interface {
                 bindnetaddr: 172.24.0.0
                 # Rings must be consecutively numbered, starting at 0.
                 ringnumber: 0
                 mcastport: 5561
         }
}

logging {
         fileline: off
         to_stderr: no
         to_logfile: yes
         logfile: /var/log/corosync.log
         to_syslog: no
         debug: off
         timestamp: on
         logger_subsys {
                 subsys: AMF
                 debug: on
         }
}

nodelist {
         node {
                 ring0_addr: 138.x.x.x
         }
         node {
                 ring0_addr: 172.24.2.61
         }
         node {
                 ring0_addr: 172.24.1.61
         }
         node {
                 ring0_addr: 172.24.1.37
         }
         node {
                 ring0_addr: 172.24.2.37
         }
}

Here is a DIFF of the configuration before/after (again, IPs smudged)

diff --git a/corosync/corosync.conf b/corosync/corosync.conf
index cc9c151..5e3e695 100644
--- a/corosync/corosync.conf
+++ b/corosync/corosync.conf
@@ -23,15 +23,6 @@ totem {
                  # Rings must be consecutively numbered, starting at 0.
                  ringnumber: 0
                 mcastport: 5561
-               member {
-                       memberaddr: 138.x.x.x
-               }
-               member {
-                       memberaddr: 172.24.1.61
-               }
-               member {
-                       memberaddr: 172.24.2.61
-               }
          }
  }

@@ -48,3 +39,21 @@ logging {
                  debug: on
          }
  }
+
+nodelist {
+       node {
+               ring0_addr: 138.x.x.x
+       }
+       node {
+               ring0_addr: 172.24.2.61
+       }
+       node {
+               ring0_addr: 172.24.1.61
+       }
+       node {
+               ring0_addr: 172.24.1.37
+       }
+       node {
+               ring0_addr: 172.24.2.37
+       }
+}


On Thu, May 7, 2015 at 5:10 PM, Jan Friesse <jfriesse@xxxxxxxxxx> wrote:

Otheus,

otheus uibk napsal(a):

Here is a top-except from corosync 2.3.4 running for under 15 hours:

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
15406 root      20   0 10.5g 9.2g 9168 S  2.0 58.8  11:50.07 corosync

(I'm using fixed-width font in gmail;I have no idea what happens to this
text when going via pipermail)

It's showing a RES usage of 9.2 GB. This high memory usage is after a
relatively minor configuration change -- moving the listed nodes from
totem.interface.member{ } to  nodelist.node { }. Two nodes were also
added.
AFAIK this was the only change.


This look like a serious issue. Are you able to reproduce it? What
version of libqb are you using?

Regards,
   Honza


A review of changes since 2.3.4 indicates this has not been fixed since
that release.




_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss





--
Otheus
otheus.uibk@xxxxxxxxx
otheus.shelling@xxxxxxxxxx





_______________________________________________
discuss mailing list
discuss@xxxxxxxxxxxx
http://lists.corosync.org/mailman/listinfo/discuss