Re: ceph-mon cpu usage

Luis Periquito <periquito@xxxxxxxxx> · Wed, 22 Jul 2015 16:29:52 +0100

This cluster is server RBD storage for openstack, and today all the I/O was just stopped.After looking in the boxes ceph-mon was using 17G ram - and this was on *all* the mons. Restarting the main one just made it work again (I restarted the other ones because they were using a lot of ram).
This has happened twice now (first was last Monday).

As this is considered a prod cluster there is no logging enabled, and I can't reproduce it - our test/dev clusters have been working fine, and have neither symptoms, but they were upgraded from firefly.
What can we do to help debug the issue? Any ideas on how to identify the underlying issue?

thanks,

On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito <periquito@xxxxxxxxx> wrote:
Hi all,

I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including replication). There are 3 MONs on this cluster.
I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer (0.94.2).

This cluster was installed with Hammer (0.94.1) and has only been upgraded to the latest available version.

On the three mons one is mostly idle, one is using ~170% CPU, and one is using ~270% CPU. They will change as I restart the process (usually the idle one is the one with the lowest uptime).

Running a perf top againt the ceph-mon PID on the non-idle boxes it wields something like this:

  4.62%  libpthread-2.19.so    [.] pthread_mutex_unlock
  3.95%  libpthread-2.19.so    [.] pthread_mutex_lock
  3.91%  libsoftokn3.so        [.] 0x000000000001db26
  2.38%  [kernel]              [k] _raw_spin_lock
  2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
  1.79%  ceph-mon              [.] DispatchQueue::enqueue(Message*, int, unsigned long)
  1.62%  ceph-mon              [.] RefCountedObject::get()
  1.58%  libpthread-2.19.so    [.] pthread_mutex_trylock
  1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
  1.24%  libc-2.19.so          [.] 0x0000000000097fd0
  1.20%  ceph-mon              [.] ceph::buffer::ptr::release()
  1.18%  ceph-mon              [.] RefCountedObject::put()
  1.15%  libfreebl3.so         [.] 0x00000000000542a8
  1.05%  [kernel]              [k] update_cfs_shares
  1.00%  [kernel]              [k] tcp_sendmsg

The cluster is mostly idle, and it's healthy. The store is 69MB big, and the MONs are consuming around 700MB of RAM.

Any ideas on this situation? Is it safe to ignore?

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com