Re: question on mon memory usage

Joao Eduardo Luis <joao.luis@xxxxxxxxxxx> · Mon, 04 Mar 2013 19:23:32 +0000

On 03/04/2013 07:12 PM, Travis Rhoden wrote:
Joao,

Were you able to glean anything useful from the memory dump I provided?

Hey Travis,

Haven't had the chance to look into the dump, but it's still on my stack 
to go over as soon as I'm able to get into it.

The mon did eventually crash, presumable from out of memory (althought
I'm failing to find oom_killer log entries.  still searching).  Here's
what it looked like:

This looks like a result from an ENOSPC.  You should check if you have 
any available space on the disk where the monitor resides.  It might 
very well be a side-effect of whatever has made your memory consumption 
go through the roof.  If you look into the store and check what is 
consuming the space it would be much appreciated -- my money is on the 
versions under 'pgmap/' or 'osdmap/'.

Thanks!
  -Joao

2013-03-02 03:42:54.993851 7ff9e9c97700 -1 mon/MonitorStore.cc: In
function 'void MonitorStore::write_bl_ss(ceph::bufferlist&, const
char*, const char*, bool)' thread 7ff9e9c97700 time 2013-03-02
03:42:54.984980
mon/MonitorStore.cc: 382: FAILED assert(!err)

  ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
  1: (MonitorStore::write_bl_ss(ceph::buffer::list&, char const*, char
const*, bool)+0xe18) [0x528398]
  2: (OSDMonitor::update_from_paxos()+0xabb) [0x4b733b]
  3: (PaxosService::_active()+0x202) [0x4a2492]
  4: (Context::complete(int)+0xa) [0x485a8a]
  5: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0x11d) [0x4878cd]
  6: (Paxos::handle_lease(MMonPaxos*)+0x54f) [0x496a5f]
  7: (Paxos::dispatch(PaxosServiceMessage*)+0x28b) [0x49f3ab]
  8: (Monitor::_ms_dispatch(Message*)+0xfdf) [0x484b6f]
  9: (Monitor::ms_dispatch(Message*)+0x32) [0x4945c2]
  10: (DispatchQueue::entry()+0x349) [0x63d009]
  11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5d67bd]
  12: (()+0x7e9a) [0x7ff9ef082e9a]
  13: (clone()+0x6d) [0x7ff9edf23cbd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

Interestingly an OSD went down about the same time (again, I suspect
oom, but can't find it).  Here's the log entry:

2013-03-02 03:42:14.025938 7f32cf88c700 -1 *** Caught signal (Aborted) **
  in thread 7f32cf88c700

  ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
  1: /usr/bin/ceph-osd() [0x78430a]
  2: (()+0xfcb0) [0x7f32e1eb2cb0]
  3: (gsignal()+0x35) [0x7f32e0871425]
  4: (abort()+0x17b) [0x7f32e0874b8b]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f32e11c369d]
  6: (()+0xb5846) [0x7f32e11c1846]
  7: (()+0xb5873) [0x7f32e11c1873]
  8: (()+0xb596e) [0x7f32e11c196e]
  9: (operator new[](unsigned long)+0x47e) [0x7f32e1656b1e]
  10: (ceph::buffer::create(unsigned int)+0x67) [0x82ec47]
  11: (ceph::buffer::ptr::ptr(unsigned int)+0x15) [0x82efb5]
  12: (FileStore::read(coll_t, hobject_t const&, unsigned long,
unsigned long, ceph::buffer::list&)+0x1ae) [0x6f814e]
  13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
bool)+0x347) [0x6970e7]
  14: (PG::chunky_scrub()+0x375) [0x69bee5]
  15: (PG::scrub()+0x145) [0x69d265]
  16: (OSD::ScrubWQ::_process(PG*)+0xc) [0x63437c]
  17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x823cf6]
  18: (ThreadPool::WorkThread::entry()+0x10) [0x825b20]
  19: (()+0x7e9a) [0x7f32e1eaae9a]
  20: (clone()+0x6d) [0x7f32e092ecbd]
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.

  - Travis

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com