Re: Unexpected OSD down during deep-scrub

Yann Dupont <yd@xxxxxxxxx> · Wed, 04 Mar 2015 00:04:09 +0100

Le 03/03/2015 22:03, Italo Santos a écrit :

I realised that when the first OSD goes down, the cluster was 
performing a deep-scrub and I found the bellow trace on the logs of 
osd.8, anyone can help me understand why the osd.8, and other osds, 
unexpected goes down?

I'm afraid I've seen this this afternoon too on my test cluster, just 
after upgrading from 0.87 to 0.93. After an initial migration success, 
some OSD started to go down : All presented similar stack traces , with 
magic word "scrub" in it :

ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: /usr/bin/ceph-osd() [0xbeb3dc]
 2: (()+0xf0a0) [0x7f8f3ca130a0]
 3: (gsignal()+0x35) [0x7f8f3b37d165]
 4: (abort()+0x180) [0x7f8f3b3803e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d]
 6: (()+0x63996) [0x7f8f3bbd1996]
 7: (()+0x639c3) [0x7f8f3bbd19c3]
 8: (()+0x63bee) [0x7f8f3bbd1bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x220) [0xcd74f0]
 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, 
utime_t)+0x1fc) [0x97259c]
 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) 
[0x97344a]
 12: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, 
std::pair<unsigned int, unsigned int>, std::less<hobject_t>, 
std::allocator<std::pair<hobject_t const, std::pa
ir<unsigned int, unsigned int> > > > const&)+0x2e4d) [0x9a5ded]
 13: (PG::scrub_compare_maps()+0x658) [0x916378]
 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x202) [0x917ee2]
 15: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x919f83]
 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7eff93]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40]
 19: (()+0x6b50) [0x7f8f3ca0ab50]
 20: (clone()+0x6d) [0x7f8f3b42695d]

As a temporary measure, noscrub and nodeep-scrub are now set for this 
cluster, and all is working fine right now.

So there is probably something wrong here. Need to investigate further.

Cheers,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com