On 03/04/2013 07:12 PM, Travis Rhoden wrote:
Joao,
Were you able to glean anything useful from the memory dump I provided?
Hey Travis,
Haven't had the chance to look into the dump, but it's still on my stack
to go over as soon as I'm able to get into it.
The mon did eventually crash, presumable from out of memory (althought
I'm failing to find oom_killer log entries. still searching). Here's
what it looked like:
This looks like a result from an ENOSPC. You should check if you have
any available space on the disk where the monitor resides. It might
very well be a side-effect of whatever has made your memory consumption
go through the roof. If you look into the store and check what is
consuming the space it would be much appreciated -- my money is on the
versions under 'pgmap/' or 'osdmap/'.
Thanks!
-Joao
2013-03-02 03:42:54.993851 7ff9e9c97700 -1 mon/MonitorStore.cc: In
function 'void MonitorStore::write_bl_ss(ceph::bufferlist&, const
char*, const char*, bool)' thread 7ff9e9c97700 time 2013-03-02
03:42:54.984980
mon/MonitorStore.cc: 382: FAILED assert(!err)
ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
1: (MonitorStore::write_bl_ss(ceph::buffer::list&, char const*, char
const*, bool)+0xe18) [0x528398]
2: (OSDMonitor::update_from_paxos()+0xabb) [0x4b733b]
3: (PaxosService::_active()+0x202) [0x4a2492]
4: (Context::complete(int)+0xa) [0x485a8a]
5: (finish_contexts(CephContext*, std::list<Context*,
std::allocator<Context*> >&, int)+0x11d) [0x4878cd]
6: (Paxos::handle_lease(MMonPaxos*)+0x54f) [0x496a5f]
7: (Paxos::dispatch(PaxosServiceMessage*)+0x28b) [0x49f3ab]
8: (Monitor::_ms_dispatch(Message*)+0xfdf) [0x484b6f]
9: (Monitor::ms_dispatch(Message*)+0x32) [0x4945c2]
10: (DispatchQueue::entry()+0x349) [0x63d009]
11: (DispatchQueue::DispatchThread::entry()+0xd) [0x5d67bd]
12: (()+0x7e9a) [0x7ff9ef082e9a]
13: (clone()+0x6d) [0x7ff9edf23cbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
Interestingly an OSD went down about the same time (again, I suspect
oom, but can't find it). Here's the log entry:
2013-03-02 03:42:14.025938 7f32cf88c700 -1 *** Caught signal (Aborted) **
in thread 7f32cf88c700
ceph version 0.56.3 (6eb7e15a4783b122e9b0c85ea9ba064145958aa5)
1: /usr/bin/ceph-osd() [0x78430a]
2: (()+0xfcb0) [0x7f32e1eb2cb0]
3: (gsignal()+0x35) [0x7f32e0871425]
4: (abort()+0x17b) [0x7f32e0874b8b]
5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f32e11c369d]
6: (()+0xb5846) [0x7f32e11c1846]
7: (()+0xb5873) [0x7f32e11c1873]
8: (()+0xb596e) [0x7f32e11c196e]
9: (operator new[](unsigned long)+0x47e) [0x7f32e1656b1e]
10: (ceph::buffer::create(unsigned int)+0x67) [0x82ec47]
11: (ceph::buffer::ptr::ptr(unsigned int)+0x15) [0x82efb5]
12: (FileStore::read(coll_t, hobject_t const&, unsigned long,
unsigned long, ceph::buffer::list&)+0x1ae) [0x6f814e]
13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
bool)+0x347) [0x6970e7]
14: (PG::chunky_scrub()+0x375) [0x69bee5]
15: (PG::scrub()+0x145) [0x69d265]
16: (OSD::ScrubWQ::_process(PG*)+0xc) [0x63437c]
17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x823cf6]
18: (ThreadPool::WorkThread::entry()+0x10) [0x825b20]
19: (()+0x7e9a) [0x7f32e1eaae9a]
20: (clone()+0x6d) [0x7f32e092ecbd]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
- Travis
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com