osds crashing during hit_set_trim and hit_set_remove_all

Mike Lovell <mike.lovell@xxxxxxxxxxxxx> · Fri, 3 Mar 2017 03:35:25 -0700

i started an upgrade process to go from 0.94.7 to 10.2.5 on a production cluster that is using cache tiering. this cluster has 3 monitors, 28 storage nodes, around 370 osds. the upgrade of the monitors completed without issue. i then upgraded 2 of the storage nodes, and after the restarts, the osds started crashing during hit_set_trim. here is some of the output from the log.
2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)' thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
osd/ReplicatedPG.cc: 10514: FAILED assert(obc)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0xbddac5]
 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)+0x75f) [0x87e48f]
 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>, std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
 7: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f) [0xbcd1cf]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 10: (()+0x7dc5) [0x7f8c1c209dc5]
 11: (clone()+0x6d) [0x7f8c1aceaced]

it started on just one osd and then spread to others until most of the osds that are part of the cache tier were crashing. that was happening on both the osds that were running jewel and on the ones running hammer. in the process of trying to sort this out, the use_gmt_hitset option was set to true and all of the osds were upgraded to hammer. we still have not been able to determine a cause or a fix.

it looks like when hit_set_trim and hit_set_remove_all are being called, they are calling hit_set_archive_object() to generate a name based on a timestamp and then calling get_object_context() which then returns nothing and triggers an assert.

i raised the debug_osd to 10/10 and then analyzed the logs after the crash. i found the following in the ceph osd log afterwards.

2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043 pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923] local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0 266042/266042/266042) [146,116,179] r=0 lpr=266042
 pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded NIBBLEWISE] get_object_context: no obc for soid 19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
6Z:head and !can_create
2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters -- start
2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters -- finish
2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify from osd.255
2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters -- start
2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters -- finish
2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time 2017-03-03 03:10:31.918201
osd/ReplicatedPG.cc: 11494: FAILED assert(obc)

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7f21acee9425]
 2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
 3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
 4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac) [0x7f21ac916adc]
 5: (boost::statechart::simple_state<PG::RecoveryState::Active, PG::RecoveryState::Primary, PG::RecoveryState::Activating, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0x179) [0x7f21a
c974909]
 6: (boost::statechart::simple_state<PG::RecoveryState::Activating, PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, (boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base const&, void const*)+0xcd) [0x7f21ac977ccd]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine, PG::RecoveryState::Initial, std::allocator<void>, boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base const&)+0x6b) [0x7f21ac95
d9cb]
 8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>, PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&, ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350]
 13: (()+0x7dc5) [0x7f21aac04dc5]
 14: (clone()+0x6d) [0x7f21a928f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

it looks like it was looking for a hit_set with a timestamp of 05:55:58.459084Z but the timestamp on the log lines is for 3:10:31.921096. we have tried various options of the hit_set_count and hit_set_period but the osds are still trying to find objects for timestamps that don't exist. i've looked through the osd pg directories (i.e. /var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects but with wildly different timestamps.

there are a few bugs on tracker that look like their related but we haven't been able to find a fix from them yet. http://tracker.ceph.com/issues/13185 is the closest one but it hasn't given us a hint on a fix yet. http://tracker.ceph.com/issues/14399 and http://tracker.ceph.com/issues/9732 also seem related but also haven't helped. we have verified that all nodes are using the same timezone and have consistent timestamps. we're currently looking at options to edit the hit_set_setup() code to skip the hit_set_remove() call and the failing code. this will give us empty hit sets and the thought is that we will then flush the cache tier and remove it then add one back. still not sure if its going to work.

does anyone else have any other thoughts about how to handle this or why this is happening? or what else we could do to get the osds back online? this has crashed almost all of the cache tier osds and the cluster is unusable at the moment.

thanks in advance.
mike
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com