Hi Sam, Updated with some more info. > -----Original Message----- > From: ceph-users [mailto:ceph-users-bounces@xxxxxxxxxxxxxx] On Behalf Of Samuel Just > Sent: 17 November 2016 19:02 > To: Nick Fisk <nick@xxxxxxxxxx> > Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > Subject: Re: After OSD Flap - FAILED assert(oi.version == i->first) > > Puzzling, added a question to the ticket. > -Sam > > On Thu, Nov 17, 2016 at 4:32 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > > Hi Sam, > > > > I've updated the ticket with logs from the wip run. > > > > Nick > > > >> -----Original Message----- > >> From: Samuel Just [mailto:sjust@xxxxxxxxxx] > >> Sent: 15 November 2016 18:30 > >> To: Nick Fisk <nick@xxxxxxxxxx> > >> Cc: Ceph Users <ceph-users@xxxxxxxxxxxxxx> > >> Subject: Re: After OSD Flap - FAILED assert(oi.version > >> == i->first) > >> > >> http://tracker.ceph.com/issues/17916 > >> > >> I just pushed a branch wip-17916-jewel based on v10.2.3 with some > >> additional debugging. Once it builds, would you be able to start the > >> afflicted osds with that version of ceph-osd and > >> > >> debug osd = 20 > >> debug ms = 1 > >> debug filestore = 20 > >> > >> and get me the log? > >> -Sam > >> > >> On Tue, Nov 15, 2016 at 2:06 AM, Nick Fisk <nick@xxxxxxxxxx> wrote: > >> > Hi, > >> > > >> > I have two OSD's which are failing with an assert which looks > >> > related to missing objects. This happened after a large RBD > >> > snapshot was deleted causing several OSD's to start flapping as > >> > they experienced high load. Cluster is fully recovered and I don't > >> > need any help from a recovery perspective. I'm happy to Zap and > >> > recreate OSD's, > >> which I will probably do in a couple of days time. Or if anybody > >> looks at the error and see's an easy way to get the OSD to start up, then bonus!!! > >> > > >> > However, I thought I would post in case there is any interest in > >> > trying to diagnose why this happened. There was no power or > >> > networking issues and no hard reboot's, so this is purely contained > >> within the Ceph OSD process. > >> > > >> > The objects that it claims are missing are from the RBD that had > >> > the snapshot deleted. I'm guessing that the last command before the > >> > OSD died at some point was to delete those two objects which did > >> > actually happen, but for some reason the OSD had died before it got > >> confirmation??? And now it's trying to delete them, but they don't exist. > >> > > >> > I have the full debug 20 log, but pretty much all the lines above > >> > the below snippet just have it deleting thousands of objects without any problems. > >> > > >> > Nick > >> > > >> > -4> 2016-11-15 09:46:52.061643 7f728f9368c0 20 read_log 6 divergent_priors > >> > -3> 2016-11-15 09:46:52.061779 7f728f9368c0 10 read_log checking for missing items over interval (0'0,1607344'260104] > >> > -2> 2016-11-15 09:46:52.069987 7f728f9368c0 15 read_log > >> > missing 1553246'255377,1:96e51ad6:::rbd_data.6fd18238e1f29.00000000002555c5:head > >> > -1> 2016-11-15 09:46:52.070007 7f728f9368c0 15 read_log > >> > missing 1553190'255366,1:96e51ad6:::rbd_data.6fd18238e1f29.00000000002555c5:6c > >> > 0> 2016-11-15 09:46:52.071471 7f728f9368c0 -1 osd/PGLog.cc: In > >> > function 'static void PGLog::read_log(ObjectStore*, coll_t, coll_t, > >> > ghobject_t, const pg_info_t&, std::map<eversion_t, hobject_t>&, > >> > PGLog::IndexedLog&, pg_missing_t&, std::ostringstream&, const > >> > DoutPrefixProvider*, std::set<std::__cxx11::basic_string<char> >*)' > >> > thread 7f728f9368c0 time 2016-11-15 09:46:52.070023 > >> > osd/PGLog.cc: 1047: FAILED assert(oi.version == i->first) > >> > > >> > ceph version 10.2.3 (ecc23778eb545d8dd55e2e4735b53cc93f92e65b) > >> > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > >> > const*)+0x80) [0x5642d2734ea0] > >> > 2: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, > >> > pg_info_t const&, std::map<eversion_t, hobject_t, > >> > std::less<eversion_t>, std::allocator<std::pair<eversion_t const, > >> > hobject_t> > >&, PGLog::IndexedLog&, pg_missing_t&, > >> > std::__cxx11::basic_ostringstream<char, std::char_traits<char>, > >> > std::allocator<char> >&, DoutPrefixProvider const*, > >> > std::set<std::__cxx11::basic_string<char, std::char_traits<char>, > >> > std::allocator<char> >, std::less<std::__cxx11::basic_string<char, > >> > std::char_traits<char>, std::allocator<char> > >, > >> > std::allocator<std::__cxx11::basic_string<char, > >> > std::char_traits<char>, std::allocator<char> > > >*)+0x719) > >> > [0x5642d22e2fd9] > >> > 3: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x2f6) > >> > [0x5642d21172d6] > >> > 4: (OSD::load_pgs()+0x87d) [0x5642d205345d] > >> > 5: (OSD::init()+0x2026) [0x5642d205e7a6] > >> > 6: (main()+0x2ea5) [0x5642d1fd08f5] > >> > 7: (__libc_start_main()+0xf0) [0x7f728c77c830] > >> > 8: (_start()+0x29) [0x5642d2011f89] > >> > NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. > >> > > >> > _______________________________________________ > >> > ceph-users mailing list > >> > ceph-users@xxxxxxxxxxxxxx > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@xxxxxxxxxxxxxx > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com