G'day, I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: http://tracker.ceph.com/issues/6233 ---- ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) 1: /usr/bin/ceph-osd() [0x8530a2] 2: (()+0xf030) [0x7f541ca39030] 3: (gsignal()+0x35) [0x7f541b132475] 4: (abort()+0x180) [0x7f541b1356f0] 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] 6: (()+0x63996) [0x7f541b985996] 7: (()+0x639c3) [0x7f541b9859c3] 8: (()+0x63bee) [0x7f541b985bee] 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x29) [0x95b579] 11: (object_info_t::object_info_t(ceph::buffer::list&)+0x180) [0x695ec0] 12: (PG::repair_object(hobject_t const&, ScrubMap::object*, int, int)+0xc7) [0x7646b7] 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] 14: (PG::scrub_finish()+0x4f) [0x76587f] 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x10d6) [0x76cb96] 16: (PG::scrub(ThreadPool::TPHandle&)+0x138) [0x76d7e8] 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0xf) [0x70515f] 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] 20: (()+0x6b50) [0x7f541ca30b50] 21: (clone()+0x6d) [0x7f541b1daa7d] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ---- This occurs as a result of: # ceph pg dump | grep inconsistent 2.12 2723 0 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'690441 20120'7914185 [6,7] [6,7] 20021'675967 2013-09-03 15:58:12.459188 19384'665404 2013-08-28 12:42:07.490877 # ceph pg repair 2.12 Looking at PG::repair_object per line 12 of the backtrace, I can see a dout(10) which should tell me the problem object: ---- src/osd/PG.cc: void PG::repair_object(const hobject_t& soid, ScrubMap::object *po, int bad_peer, int ok_peer) { dout(10) << "repair_object " << soid << " bad_peer osd." << bad_peer << " ok_peer osd." << ok_peer << dendl; ... } ---- The 'ceph pg dump' output above tells me the primary osd is '6', so I can increase the logging level to 10 on osd.6 to get the debug output, and repair again: # ceph osd tell 6 injectargs '--debug_osd 0/10' # ceph pg repair 2.12 I get the same OSD crash, but this time it logs the dout from above, which shows the problem object: -1> 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 bad_peer osd.7 ok_peer osd.6 0> 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** So... Firstly, is anyone interested in further investigating the problem to fix the crash behaviour? And, what's the best way to fix the pool? Cheers, Chris -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html