On Fri, Mar 3, 2017 at 11:40 PM, Sage Weil <sage at newdream.net> wrote: > On Fri, 3 Mar 2017, Mike Lovell wrote: >> i started an upgrade process to go from 0.94.7 to 10.2.5 on a production >> cluster that is using cache tiering. this cluster has 3 monitors, 28 storage >> nodes, around 370 osds. the upgrade of the monitors completed without issue. >> i then upgraded 2 of the storage nodes, and after the restarts, the osds >> started crashing during hit_set_trim. here is some of the output from the >> log. >> 2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function >> 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' >> thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020 >> osd/ReplicatedPG.cc: 10514: FAILED assert(obc) >> >> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbddac5] >> 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned >> int)+0x75f) [0x87e48f] >> 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab] >> 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a] >> 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, >> ThreadPool::TPHandle&)+0x68a) [0x83be4a] >> 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, >> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5] >> 7: (OSD::ShardedOpWQ::_process(unsigned int, >> ceph::heartbeat_handle_d*)+0x333) [0x69ab33] >> 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) >> [0xbcd1cf] >> 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300] >> 10: (()+0x7dc5) [0x7f8c1c209dc5] >> 11: (clone()+0x6d) [0x7f8c1aceaced] >> >> it started on just one osd and then spread to others until most of the osds >> that are part of the cache tier were crashing. that was happening on both >> the osds that were running jewel and on the ones running hammer. in the >> process of trying to sort this out, the use_gmt_hitset option was set to >> true and all of the osds were upgraded to hammer. we still have not been >> able to determine a cause or a fix. >> >> it looks like when hit_set_trim and hit_set_remove_all are being called, >> they are calling hit_set_archive_object() to generate a name based on a >> timestamp and then calling get_object_context() which then returns nothing >> and triggers an assert. >> >> i raised the debug_osd to 10/10 and then analyzed the logs after the crash. >> i found the following in the ceph osd log afterwards. >> >> 2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043 >> pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923] >> local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0 >> 266042/266042/266042) [146,116,179] r=0 lpr=266042 >> pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded >> NIBBLEWISE] get_object_context: no obc for soid >> 19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03 >> 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101 >> 6Z:head and !can_create >> 2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters -- >> start >> 2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters -- >> finish >> 2017-03-03 03:10:31.921076 7f2194051700 7 osd.146 266043 handle_pg_notify >> from osd.255 >> 2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters -- >> start >> 2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters -- >> finish >> 2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function >> 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time >> 2017-03-03 03:10:31.918201 >> osd/ReplicatedPG.cc: 11494: FAILED assert(obc) >> >> ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0x7f21acee9425] >> 2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92] >> 3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd] >> 4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac) >> [0x7f21ac916adc] >> 5: (boost::statechart::simple_state<PG::RecoveryState::Active, >> PG::RecoveryState::Primary, PG::RecoveryState::Activating,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba >> se const&, void const*)+0x179) [0x7f21a >> c974909] >> 6: (boost::statechart::simple_state<PG::RecoveryState::Activating, >> PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, >> mpl_::na, mpl_::na, mpl_: >> :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba >> se const&, void const*)+0xcd) [0x7f21ac977ccd] >> 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, >> PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>::send_event(boost::statechart >> ::event_base const&)+0x6b) [0x7f21ac95 >> d9cb] >> 8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, >> PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24] >> 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > >> const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99] >> 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, >> ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552] >> 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e] >> 12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350] >> 13: (()+0x7dc5) [0x7f21aac04dc5] >> 14: (clone()+0x6d) [0x7f21a928f73d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to >> interpret this. >> >> it looks like it was looking for a hit_set with a timestamp >> of 05:55:58.459084Z but the timestamp on the log lines is for >> 3:10:31.921096. we have tried various options of the hit_set_count and >> hit_set_period but the osds are still trying to find objects for timestamps >> that don't exist. i've looked through the osd pg directories (i.e. >> /var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects >> but with wildly different timestamps. >> >> there are a few bugs on tracker that look like their related but we haven't >> been able to find a fix from them yet. http://tracker.ceph.com/issues/13185 >> is the closest one but it hasn't given us a hint on a fix >> yet. http://tracker.ceph.com/issues/14399 >> and http://tracker.ceph.com/issues/9732 also seem related but also haven't >> helped. we have verified that all nodes are using the same timezone and have >> consistent timestamps. we're currently looking at options to edit the >> hit_set_setup() code to skip the hit_set_remove() call and the failing code. >> this will give us empty hit sets and the thought is that we will then flush >> the cache tier and remove it then add one back. still not sure if its going >> to work. >> >> does anyone else have any other thoughts about how to handle this or why >> this is happening? or what else we could do to get the osds back online? >> this has crashed almost all of the cache tier osds and the cluster is >> unusable at the moment. > > This looks like it's a variation on the earlier bug, but I'm not sure why > you're hitting it an others haven't. In the meantime, to get your cluster > up, you just need to skip teh hitsets that won't load obc's. This will > leave the objects behind but it's not a big deal and can be cleaned up > later. > > See github.com/liewegas/ceph/commit/wip-jewel-nocrash-gmt. You can build > that or you should see some (completely untested!) packages appear at > shaman.ceph.com in an hour or so. > > Maybe Kefu knows why this might have triggered? i pulled together a test at https://github.com/ceph/ceph/pull/13805, to exercise the scenario where we upgrade from hammer to jewel with a cache tier, but it failed to reproduce this issue. maybe we should just skip removing the archive hitset object if it is missing locally. see https://github.com/ceph/ceph/pull/13821. > > sage -- Regards Kefu Chai