On 07/24/2015 02:31 PM, Luis Periquito wrote:
Now it's official, I have a weird one!
Restarted one of the ceph-mons with jemalloc and it didn't make any
difference. It's still using a lot of cpu and still not freeing up memory...
The issue is that the cluster almost stops responding to requests, and
if I restart the primary mon (that had almost no memory usage nor cpu)
the cluster goes back to its merry way responding to requests.
Does anyone have any idea what may be going on? The worst bit is that I
have several clusters just like this (well they are smaller), and as we
do everything with puppet, they should be very similar... and all the
other clusters are just working fine, without any issues whatsoever...
We've seen cases where leveldb can't compact fast enough and memory
balloons, but it's usually associated with extreme CPU usage as well.
It would be showing up in perf though if that were the case...
On 24 Jul 2015 10:11, "Jan Schermer" <jan@xxxxxxxxxxx
<mailto:jan@xxxxxxxxxxx>> wrote:
You don’t (shouldn’t) need to rebuild the binary to use jemalloc. It
should be possible to do something like
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …
The last time we tried it segfaulted after a few minutes, so YMMV
and be careful.
Jan
On 23 Jul 2015, at 18:18, Luis Periquito <periquito@xxxxxxxxx
<mailto:periquito@xxxxxxxxx>> wrote:
Hi Greg,
I've been looking at the tcmalloc issues, but did seem to affect
osd's, and I do notice it in heavy read workloads (even after the
patch and
increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This
is affecting the mon process though.
looking at perf top I'm getting most of the CPU usage in mutex
lock/unlock
5.02% libpthread-2.19.so <http://libpthread-2.19.so/> [.]
pthread_mutex_unlock
3.82% libsoftokn3.so [.] 0x000000000001e7cb
3.46% libpthread-2.19.so <http://libpthread-2.19.so/> [.]
pthread_mutex_lock
I could try to use jemalloc, are you aware of any built binaries?
Can I mix a cluster with different malloc binaries?
On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum <greg@xxxxxxxxxxx
<mailto:greg@xxxxxxxxxxx>> wrote:
On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito
<periquito@xxxxxxxxx <mailto:periquito@xxxxxxxxx>> wrote:
> The ceph-mon is already taking a lot of memory, and I ran a
heap stats
> ------------------------------------------------
> MALLOC: 32391696 ( 30.9 MiB) Bytes in use by application
> MALLOC: + 27597135872 (26318.7 MiB) Bytes in page heap freelist
> MALLOC: + 16598552 ( 15.8 MiB) Bytes in central cache
freelist
> MALLOC: + 14693536 ( 14.0 MiB) Bytes in transfer cache
freelist
> MALLOC: + 17441592 ( 16.6 MiB) Bytes in thread cache
freelists
> MALLOC: + 116387992 ( 111.0 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 27794649240 (26507.0 MiB) Actual memory used
(physical + swap)
> MALLOC: + 26116096 ( 24.9 MiB) Bytes released to OS
(aka unmapped)
> MALLOC: ------------
> MALLOC: = 27820765336 (26531.9 MiB) Virtual address space used
> MALLOC:
> MALLOC: 5683 Spans in use
> MALLOC: 21 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
>
> after that I ran the heap release and it went back to normal.
> ------------------------------------------------
> MALLOC: 22919616 ( 21.9 MiB) Bytes in use by application
> MALLOC: + 4792320 ( 4.6 MiB) Bytes in page heap freelist
> MALLOC: + 18743448 ( 17.9 MiB) Bytes in central cache
freelist
> MALLOC: + 20645776 ( 19.7 MiB) Bytes in transfer cache
freelist
> MALLOC: + 18456088 ( 17.6 MiB) Bytes in thread cache
freelists
> MALLOC: + 116387992 ( 111.0 MiB) Bytes in malloc metadata
> MALLOC: ------------
> MALLOC: = 201945240 ( 192.6 MiB) Actual memory used
(physical + swap)
> MALLOC: + 27618820096 <tel:%2B%20%2027618820096> (26339.4
MiB) Bytes released to OS (aka unmapped)
> MALLOC: ------------
> MALLOC: = 27820765336 (26531.9 MiB) Virtual address space used
> MALLOC:
> MALLOC: 5639 Spans in use
> MALLOC: 29 Thread heaps in use
> MALLOC: 8192 Tcmalloc page size
> ------------------------------------------------
>
> So it just seems the monitor is not returning unused memory into the OS or
> reusing already allocated memory it deems as free...
Yep. This is a bug (best we can tell) in some versions of tcmalloc
combined with certain distribution stacks, although I don't think
we've seen it reported on Trusty (nor on a tcmalloc
distribution that
new) before. Alternatively some folks are seeing tcmalloc use
up lots
of CPU in other scenarios involving memory return and it may
manifest
like this, but I'm not sure. You could look through the
mailing list
for information on it.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx <mailto:ceph-users@xxxxxxxxxxxxxx>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com