Re: ceph-mon cpu usage

Luis Periquito <periquito@xxxxxxxxx> · Thu, 23 Jul 2015 08:39:32 +0100

The ceph-mon is already taking a lot of memory, and I ran a heap stats------------------------------------------------
MALLOC:       32391696 (   30.9 MiB) Bytes in use by application
MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap freelist
MALLOC: +     16598552 (   15.8 MiB) Bytes in central cache freelist
MALLOC: +     14693536 (   14.0 MiB) Bytes in transfer cache freelist
MALLOC: +     17441592 (   16.6 MiB) Bytes in thread cache freelists
MALLOC: +    116387992 (  111.0 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  27794649240 (26507.0 MiB) Actual memory used (physical + swap)
MALLOC: +     26116096 (   24.9 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
MALLOC:
MALLOC:           5683              Spans in use
MALLOC:             21              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

after that I ran the heap release and it went back to normal.
------------------------------------------------
MALLOC:       22919616 (   21.9 MiB) Bytes in use by application
MALLOC: +      4792320 (    4.6 MiB) Bytes in page heap freelist
MALLOC: +     18743448 (   17.9 MiB) Bytes in central cache freelist
MALLOC: +     20645776 (   19.7 MiB) Bytes in transfer cache freelist
MALLOC: +     18456088 (   17.6 MiB) Bytes in thread cache freelists
MALLOC: +    116387992 (  111.0 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =    201945240 (  192.6 MiB) Actual memory used (physical + swap)
MALLOC: +  27618820096 (26339.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  27820765336 (26531.9 MiB) Virtual address space used
MALLOC:
MALLOC:           5639              Spans in use
MALLOC:             29              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

So it just seems the monitor is not returning unused memory into the OS or reusing already allocated memory it deems as free...

On Wed, Jul 22, 2015 at 4:29 PM, Luis Periquito <periquito@xxxxxxxxx> wrote:
This cluster is server RBD storage for openstack, and today all the I/O was just stopped.After looking in the boxes ceph-mon was using 17G ram - and this was on *all* the mons. Restarting the main one just made it work again (I restarted the other ones because they were using a lot of ram).
This has happened twice now (first was last Monday).

As this is considered a prod cluster there is no logging enabled, and I can't reproduce it - our test/dev clusters have been working fine, and have neither symptoms, but they were upgraded from firefly.
What can we do to help debug the issue? Any ideas on how to identify the underlying issue?

thanks,

On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito <periquito@xxxxxxxxx> wrote:
Hi all,

I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including replication). There are 3 MONs on this cluster.
I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer (0.94.2).

This cluster was installed with Hammer (0.94.1) and has only been upgraded to the latest available version.

On the three mons one is mostly idle, one is using ~170% CPU, and one is using ~270% CPU. They will change as I restart the process (usually the idle one is the one with the lowest uptime).

Running a perf top againt the ceph-mon PID on the non-idle boxes it wields something like this:

  4.62%  libpthread-2.19.so    [.] pthread_mutex_unlock
  3.95%  libpthread-2.19.so    [.] pthread_mutex_lock
  3.91%  libsoftokn3.so        [.] 0x000000000001db26
  2.38%  [kernel]              [k] _raw_spin_lock
  2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
  1.79%  ceph-mon              [.] DispatchQueue::enqueue(Message*, int, unsigned long)
  1.62%  ceph-mon              [.] RefCountedObject::get()
  1.58%  libpthread-2.19.so    [.] pthread_mutex_trylock
  1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
  1.24%  libc-2.19.so          [.] 0x0000000000097fd0
  1.20%  ceph-mon              [.] ceph::buffer::ptr::release()
  1.18%  ceph-mon              [.] RefCountedObject::put()
  1.15%  libfreebl3.so         [.] 0x00000000000542a8
  1.05%  [kernel]              [k] update_cfs_shares
  1.00%  [kernel]              [k] tcp_sendmsg

The cluster is mostly idle, and it's healthy. The store is 69MB big, and the MONs are consuming around 700MB of RAM.

Any ideas on this situation? Is it safe to ignore?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com