osds crashing during hit_set_trim and hit_set_remove_all

tchaikov@xxxxxxxxx (kefu chai) · Tue, 7 Mar 2017 19:55:47 +0800

On Tue, Mar 7, 2017 at 3:30 PM, kefu chai <tchaikov at gmail.com> wrote:
> On Fri, Mar 3, 2017 at 11:40 PM, Sage Weil <sage at newdream.net> wrote:
>> On Fri, 3 Mar 2017, Mike Lovell wrote:
>>> i started an upgrade process to go from 0.94.7 to 10.2.5 on a production
>>> cluster that is using cache tiering. this cluster has 3 monitors, 28 storage
>>> nodes, around 370 osds. the upgrade of the monitors completed without issue.
>>> i then upgraded 2 of the storage nodes, and after the restarts, the osds
>>> started crashing during hit_set_trim. here is some of the output from the
>>> log.
>>> 2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function
>>> 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>>> thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
>>> osd/ReplicatedPG.cc: 10514: FAILED assert(obc)
>>>
>>>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0xbddac5]
>>>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>>> int)+0x75f) [0x87e48f]
>>>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a]
>>>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
>>> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>>>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>>> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
>>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>>> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
>>> [0xbcd1cf]
>>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>>>  10: (()+0x7dc5) [0x7f8c1c209dc5]
>>>  11: (clone()+0x6d) [0x7f8c1aceaced]
>>>
>>> it started on just one osd and then spread to others until most of the osds
>>> that are part of the cache tier were crashing. that was happening on both
>>> the osds that were running jewel and on the ones running hammer. in the
>>> process of trying to sort this out, the use_gmt_hitset option was set to
>>> true and all of the osds were upgraded to hammer. we still have not been
>>> able to determine a cause or a fix.
>>>
>>> it looks like when hit_set_trim and hit_set_remove_all are being called,
>>> they are calling hit_set_archive_object() to generate a name based on a
>>> timestamp and then calling get_object_context() which then returns nothing
>>> and triggers an assert.
>>>
>>> i raised the debug_osd to 10/10 and then analyzed the logs after the crash.
>>> i found the following in the ceph osd log afterwards.
>>>
>>> 2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043
>>> pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923]
>>> local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0
>>> 266042/266042/266042) [146,116,179] r=0 lpr=266042
>>>  pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded
>>> NIBBLEWISE] get_object_context: no obc for soid
>>> 19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03
>>> 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
>>> 6Z:head and !can_create
>>> 2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters --
>>> start
>>> 2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters --
>>> finish
>>> 2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify
>>> from osd.255
>>> 2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters --
>>> start
>>> 2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters --
>>> finish
>>> 2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function
>>> 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time
>>> 2017-03-03 03:10:31.918201
>>> osd/ReplicatedPG.cc: 11494: FAILED assert(obc)
>>>
>>>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0x7f21acee9425]
>>>  2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
>>>  3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
>>>  4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac)
>>> [0x7f21ac916adc]
>>>  5: (boost::statechart::simple_state<PG::RecoveryState::Active,
>>> PG::RecoveryState::Primary, PG::RecoveryState::Activating,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>>> se const&, void const*)+0x179) [0x7f21a
>>> c974909]
>>>  6: (boost::statechart::simple_state<PG::RecoveryState::Activating,
>>> PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
>>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>>> mpl_::na, mpl_::na, mpl_:
>>> :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>>> se const&, void const*)+0xcd) [0x7f21ac977ccd]
>>>  7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
>>> PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>::send_event(boost::statechart
>>> ::event_base const&)+0x6b) [0x7f21ac95
>>> d9cb]
>>>  8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
>>> PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
>>>  9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
>>> const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
>>>  10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&,
>>> ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552]
>>>  11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e]
>>>  12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350]
>>>  13: (()+0x7dc5) [0x7f21aac04dc5]
>>>  14: (clone()+0x6d) [0x7f21a928f73d]
>>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>>> interpret this.
>>>
>>> it looks like it was looking for a hit_set with a timestamp
>>> of 05:55:58.459084Z but the timestamp on the log lines is for
>>> 3:10:31.921096. we have tried various options of the hit_set_count and
>>> hit_set_period but the osds are still trying to find objects for timestamps
>>> that don't exist. i've looked through the osd pg directories (i.e.
>>> /var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects
>>> but with wildly different timestamps.
>>>
>>> there are a few bugs on tracker that look like their related but we haven't
>>> been able to find a fix from them yet. http://tracker.ceph.com/issues/13185
>>> is the closest one but it hasn't given us a hint on a fix
>>> yet. http://tracker.ceph.com/issues/14399
>>> and http://tracker.ceph.com/issues/9732 also seem related but also haven't
>>> helped. we have verified that all nodes are using the same timezone and have
>>> consistent timestamps. we're currently looking at options to edit the
>>> hit_set_setup() code to skip the hit_set_remove() call and the failing code.
>>> this will give us empty hit sets and the thought is that we will then flush
>>> the cache tier and remove it then add one back. still not sure if its going
>>> to work.
>>>
>>> does anyone else have any other thoughts about how to handle this or why
>>> this is happening? or what else we could do to get the osds back online?
>>> this has crashed almost all of the cache tier osds and the cluster is
>>> unusable at the moment.
>>
>> This looks like it's a variation on the earlier bug, but I'm not sure why
>> you're hitting it an others haven't.  In the meantime, to get your cluster
>> up, you just need to skip teh hitsets that won't load obc's.   This will
>> leave the objects behind but it's not a big deal and can be cleaned up
>> later.
>>
>> See github.com/liewegas/ceph/commit/wip-jewel-nocrash-gmt.  You can build
>> that or you should see some (completely untested!) packages appear at
>> shaman.ceph.com in an hour or so.
>>
>> Maybe Kefu knows why this might have triggered?
>
> i pulled together a test at https://github.com/ceph/ceph/pull/13805,
> to exercise the scenario where we upgrade from hammer to jewel with a
> cache tier, but it failed to reproduce this issue.
>
> maybe we should just skip removing the archive hitset object if it is
> missing locally. see https://github.com/ceph/ceph/pull/13821.

turns out a bug in hammer! fixes pending on review.

BTW, it's tracked by http://tracker.ceph.com/issues/19185.

-- 
Regards
Kefu Chai