Re: Unexpected OSD down during deep-scrub

Italo Santos <okdokk@xxxxxxxxx> · Thu, 5 Mar 2015 01:18:50 -0300

                    New issue created - http://tracker.ceph.com/issues/11027

Regards.

Italo Santos
http://italosantos.com.br/

                On Tuesday, March 3, 2015 at 9:23 PM, Loic Dachary wrote:

                    Hi Yann,

That seems related to http://tracker.ceph.com/issues/10536 which seems to be resolved. Could you create a new issue with a link to 10536 ? More logs and ceph report would also be useful to figure out why it resurfaced.

Thanks !

On 04/03/2015 00:04, Yann Dupont wrote:

Le 03/03/2015 22:03, Italo Santos a écrit :

I realised that when the first OSD goes down, the cluster was performing a deep-scrub and I found the bellow trace on the logs of osd.8, anyone can help me understand why the osd.8, and other osds, unexpected goes down?

I'm afraid I've seen this this afternoon too on my test cluster, just after upgrading from 0.87 to 0.93. After an initial migration success, some OSD started to go down : All presented similar stack traces , with magic word "scrub" in it :

ceph version 0.93 (bebf8e9a830d998eeaab55f86bb256d4360dd3c4)
 1: /usr/bin/ceph-osd() [0xbeb3dc]
 2: (()+0xf0a0) [0x7f8f3ca130a0]
 3: (gsignal()+0x35) [0x7f8f3b37d165]
 4: (abort()+0x180) [0x7f8f3b3803e0]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f8f3bbd389d]
 6: (()+0x63996) [0x7f8f3bbd1996]
 7: (()+0x639c3) [0x7f8f3bbd19c3]
 8: (()+0x63bee) [0x7f8f3bbd1bee]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x220) [0xcd74f0]
 10: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t)+0x1fc) [0x97259c]
 11: (ReplicatedPG::simple_repop_submit(ReplicatedPG::RepGather*)+0x7a) [0x97344a]
 12: (ReplicatedPG::_scrub(ScrubMap&, std::map<hobject_t, std::pair<unsigned int, unsigned int>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, std::pa
ir<unsigned int, unsigned int> > > > const&)+0x2e4d) [0x9a5ded]
 13: (PG::scrub_compare_maps()+0x658) [0x916378]
 14: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x202) [0x917ee2]
 15: (PG::scrub(ThreadPool::TPHandle&)+0x3a3) [0x919f83]
 16: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x13) [0x7eff93]
 17: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xcc8c49]
 18: (ThreadPool::WorkThread::entry()+0x10) [0xccac40]
 19: (()+0x6b50) [0x7f8f3ca0ab50]
 20: (clone()+0x6d) [0x7f8f3b42695d]

As a temporary measure, noscrub and nodeep-scrub are now set for this cluster, and all is working fine right now.

So there is probably something wrong here. Need to investigate further.

Cheers,

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Loïc Dachary, Artisan Logiciel Libre
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

            _______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com