Hi Sage, Does this answer your question? 2013-09-06 09:30:19.813811 7f0ae8cbc700 0 log [INF] : applying configuration change: internal_safe_to_start_threads = 'true' 2013-09-06 09:33:28.303658 7f0ae94bd700 0 log [ERR] : 2.12 osd.7: soid 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 extra attr _, extra attr snapset 2013-09-06 09:33:28.303685 7f0ae94bd700 0 log [ERR] : repair 2.12 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 no 'snapset' attr 2013-09-06 09:34:45.138468 7f0ae94bd700 0 log [ERR] : 2.12 repair stat mismatch, got 2722/2723 objects, 339/339 clones, 11307104768/11311299072 bytes. 2013-09-06 09:34:45.142215 7f0ae94bd700 0 log [ERR] : 2.12 repair 0 missing, 1 inconsistent objects 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** I've just attached the full 'debug_osd 0/10' log to the bug report. Thanks, Chris On Thu, Sep 05, 2013 at 07:38:47PM -0700, Sage Weil wrote: > Hi Chris, > > What is the inconsistency that scrub reports in the log? My guess is that > the simplest way to resolve this is to remove whichever copy you decide is > invalid, but it depends on what the inconstency it is trying/failing to > repair is. > > Thanks! > sage > > > On Fri, 6 Sep 2013, Chris Dunlop wrote: > > > G'day, > > > > I'm getting an OSD crash on 0.56.7-1~bpo70+1 whilst trying to repair an OSD: > > > > http://tracker.ceph.com/issues/6233 > > > > ---- > > ceph version 0.56.7 (14f23ab86b0058a8651895b3dc972a29459f3a33) > > 1: /usr/bin/ceph-osd() [0x8530a2] > > 2: (()+0xf030) [0x7f541ca39030] > > 3: (gsignal()+0x35) [0x7f541b132475] > > 4: (abort()+0x180) [0x7f541b1356f0] > > 5: (_gnu_cxx::_verbose_terminate_handler()+0x11d) [0x7f541b98789d] > > 6: (()+0x63996) [0x7f541b985996] > > 7: (()+0x639c3) [0x7f541b9859c3] > > 8: (()+0x63bee) [0x7f541b985bee] > > 9: (ceph::buffer::list::iterator::copy(unsigned int, char*)+0x127) [0x8fa9a7] > > 10: (object_info_t::decode(ceph::buffer::list::iterator&)+0x29) [0x95b579] > > 11: (object_info_t::object_info_t(ceph::buffer::list&)+0x180) [0x695ec0] > > 12: (PG::repair_object(hobject_t const&, ScrubMap::object*, int, int)+0xc7) [0x7646b7] > > 13: (PG::scrub_process_inconsistent()+0x9bd) [0x76534d] > > 14: (PG::scrub_finish()+0x4f) [0x76587f] > > 15: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x10d6) [0x76cb96] > > 16: (PG::scrub(ThreadPool::TPHandle&)+0x138) [0x76d7e8] > > 17: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0xf) [0x70515f] > > 18: (ThreadPool::worker(ThreadPool::WorkThread*)+0x992) [0x8f0542] > > 19: (ThreadPool::WorkThread::entry()+0x10) [0x8f14d0] > > 20: (()+0x6b50) [0x7f541ca30b50] > > 21: (clone()+0x6d) [0x7f541b1daa7d] > > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > > ---- > > > > This occurs as a result of: > > > > # ceph pg dump | grep inconsistent > > 2.12 2723 0 0 0 11311299072 159189 159189 active+clean+inconsistent 2013-09-06 09:35:47.512119 20117'690441 20120'7914185 [6,7] [6,7] 20021'675967 2013-09-03 15:58:12.459188 19384'665404 2013-08-28 12:42:07.490877 > > # ceph pg repair 2.12 > > > > Looking at PG::repair_object per line 12 of the backtrace, I can see a > > dout(10) which should tell me the problem object: > > > > ---- > > src/osd/PG.cc: > > void PG::repair_object(const hobject_t& soid, ScrubMap::object *po, int bad_peer, int ok_peer) > > { > > dout(10) << "repair_object " << soid << " bad_peer osd." << bad_peer << " ok_peer osd." << ok_peer << dendl; > > ... > > } > > ---- > > > > The 'ceph pg dump' output above tells me the primary osd is '6', so I > > can increase the logging level to 10 on osd.6 to get the debug output, > > and repair again: > > > > # ceph osd tell 6 injectargs '--debug_osd 0/10' > > # ceph pg repair 2.12 > > > > I get the same OSD crash, but this time it logs the dout from above, > > which shows the problem object: > > > > -1> 2013-09-06 09:34:45.142224 7f0ae94bd700 10 osd.6 pg_epoch: 20117 pg[2.12( v 20117'690441 (20117'689440,20117'690441] local-les=20115 n=2722 ec=1 les/c 20115/20115 20108/20112/20112) [6,7] r=0 lpr=20112 mlcod 20117'690440 active+scrubbing+deep+repair] repair_object 56987a12/rb.0.17d9b.2ae8944a.000000001e11/head//2 bad_peer osd.7 ok_peer osd.6 > > 0> 2013-09-06 09:34:45.206621 7f0ae94bd700 -1 *** Caught signal (Aborted) ** > > > > So... > > > > Firstly, is anyone interested in further investigating the problem to > > fix the crash behaviour? > > > > And, what's the best way to fix the pool? > > > > Cheers, > > > > Chris > > -- > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > the body of a message to majordomo@xxxxxxxxxxxxxxx > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html