Re: Unexpected OSD down during deep-scrub

Samuel Just <sjust@xxxxxxxxxx> · Thu, 05 Mar 2015 09:06:32 -0800

The fix for this should be in 0.93, so this must be something different,
can you reproduce with

debug osd = 20
debug ms = 1
debug filestore = 20

and post the log to http://tracker.ceph.com/issues/11027?

On Wed, 2015-03-04 at 00:04 +0100, Yann Dupont wrote:
> Le 03/03/2015 22:03, Italo Santos a écrit :
> >
> > I realised that when the first OSD goes down, the cluster was 
> > performing a deep-scrub and I found the bellow trace on the logs of 
> > osd.8, anyone can help me understand why the osd.8, and other osds, 
> > unexpected goes down?
> >
> 
> I'm afraid I've seen this this afternoon too on my test cluster, just 
> after upgrading from 0.87 to 0.93. After an initial migration success, 
> some OSD started to go down : All presented similar stack traces , with 
> magic word "scrub" in it :
> 
> ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
>   1: /usr/bin/ceph-osd() [0xbeb3dc]
>   2: (()+0xf0a0) [0x7f8f3ca130a0]
>   3: (gsignal()+0x35) [0x7f8f3b37d165]
>   4: (abort()+0x180) [0x7f8f3b3803e0]
>   5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d]
>   6: (()+0x63996) [0x7f8f3bbd1996]
>   7: (()+0x639c3) [0x7f8f3bbd19c3]
>   8: (()+0x63bee) [0x7f8f3bbd1bee]
>   9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x220) [0xcd74f0]
>   10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, 
> utime_t)+0x1fc) [0x97259c]
>   11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) 
> [0x97344a]
>   12: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, 
> std::pair<unsigned int, unsigned int>, std::less<hobject_t>, 
> std::allocator<std::pair<hobject_t const, std::pa
> ir<unsigned int, unsigned int> > > > const&)+0x2e4d) [0x9a5ded]
>   13: (PG::scrub_compare_maps()+0x658) [0x916378]
>   14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x202) [0x917ee2]
>   15: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x919f83]
>   16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7eff93]
>   17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49]
>   18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40]
>   19: (()+0x6b50) [0x7f8f3ca0ab50]
>   20: (clone()+0x6d) [0x7f8f3b42695d]
> 
> As a temporary measure, noscrub and nodeep-scrub are now set for this 
> cluster, and all is working fine right now.
> 
> So there is probably something wrong here. Need to investigate further.
> 
> Cheers,
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com