The fix for this should be in 0.93, so this must be something different, can you reproduce with debug osd = 20 debug ms = 1 debug filestore = 20 and post the log to http://tracker.ceph.com/issues/11027? On Wed, 2015-03-04 at 00:04 +0100, Yann Dupont wrote: > Le 03/03/2015 22:03, Italo Santos a écrit : > > > > I realised that when the first OSD goes down, the cluster was > > performing a deep-scrub and I found the bellow trace on the logs of > > osd.8, anyone can help me understand why the osd.8, and other osds, > > unexpected goes down? > > > > I'm afraid I've seen this this afternoon too on my test cluster, just > after upgrading from 0.87 to 0.93. After an initial migration success, > some OSD started to go down : All presented similar stack traces , with > magic word "scrub" in it : > > ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4) > 1: /usr/bin/ceph-osd() [0xbeb3dc] > 2: (()+0xf0a0) [0x7f8f3ca130a0] > 3: (gsignal()+0x35) [0x7f8f3b37d165] > 4: (abort()+0x180) [0x7f8f3b3803e0] > 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d] > 6: (()+0x63996) [0x7f8f3bbd1996] > 7: (()+0x639c3) [0x7f8f3bbd19c3] > 8: (()+0x63bee) [0x7f8f3bbd1bee] > 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x220) [0xcd74f0] > 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, > utime_t)+0x1fc) [0x97259c] > 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) > [0x97344a] > 12: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, > std::pair<unsigned int, unsigned int>, std::less<hobject_t>, > std::allocator<std::pair<hobject_t const, std::pa > ir<unsigned int, unsigned int> > > > const&)+0x2e4d) [0x9a5ded] > 13: (PG::scrub_compare_maps()+0x658) [0x916378] > 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x202) [0x917ee2] > 15: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x919f83] > 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7eff93] > 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49] > 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40] > 19: (()+0x6b50) [0x7f8f3ca0ab50] > 20: (clone()+0x6d) [0x7f8f3b42695d] > > As a temporary measure, noscrub and nodeep-scrub are now set for this > cluster, and all is working fine right now. > > So there is probably something wrong here. Need to investigate further. > > Cheers, > > > > > > > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com