osds crashing during hit_set_trim and hit_set_remove_all

tchaikov@xxxxxxxxx (kefu chai) · Tue, 7 Mar 2017 15:30:05 +0800

On Fri, Mar 3, 2017 at 11:40 PM, Sage Weil <sage at newdream.net> wrote:
> On Fri, 3 Mar 2017, Mike Lovell wrote:
>> i started an upgrade process to go from 0.94.7 to 10.2.5 on a production
>> cluster that is using cache tiering. this cluster has 3 monitors, 28 storage
>> nodes, around 370 osds. the upgrade of the monitors completed without issue.
>> i then upgraded 2 of the storage nodes, and after the restarts, the osds
>> started crashing during hit_set_trim. here is some of the output from the
>> log.
>> 2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function
>> 'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
>> thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
>> osd/ReplicatedPG.cc: 10514: FAILED assert(obc)
>>
>>  ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbddac5]
>>  2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
>> int)+0x75f) [0x87e48f]
>>  3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
>>  4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a]
>>  5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
>> ThreadPool::TPHandle&)+0x68a) [0x83be4a]
>>  6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
>> std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
>>  7: (OSD::ShardedOpWQ::_process(unsigned int,
>> ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
>>  8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
>> [0xbcd1cf]
>>  9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
>>  10: (()+0x7dc5) [0x7f8c1c209dc5]
>>  11: (clone()+0x6d) [0x7f8c1aceaced]
>>
>> it started on just one osd and then spread to others until most of the osds
>> that are part of the cache tier were crashing. that was happening on both
>> the osds that were running jewel and on the ones running hammer. in the
>> process of trying to sort this out, the use_gmt_hitset option was set to
>> true and all of the osds were upgraded to hammer. we still have not been
>> able to determine a cause or a fix.
>>
>> it looks like when hit_set_trim and hit_set_remove_all are being called,
>> they are calling hit_set_archive_object() to generate a name based on a
>> timestamp and then calling get_object_context() which then returns nothing
>> and triggers an assert.
>>
>> i raised the debug_osd to 10/10 and then analyzed the logs after the crash.
>> i found the following in the ceph osd log afterwards.
>>
>> 2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043
>> pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923]
>> local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0
>> 266042/266042/266042) [146,116,179] r=0 lpr=266042
>>  pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0 active+degraded
>> NIBBLEWISE] get_object_context: no obc for soid
>> 19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03
>> 05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
>> 6Z:head and !can_create
>> 2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters --
>> start
>> 2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters --
>> finish
>> 2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify
>> from osd.255
>> 2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters --
>> start
>> 2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters --
>> finish
>> 2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function
>> 'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time
>> 2017-03-03 03:10:31.918201
>> osd/ReplicatedPG.cc: 11494: FAILED assert(obc)
>>
>>  ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0x7f21acee9425]
>>  2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
>>  3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
>>  4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated const&)+0xac)
>> [0x7f21ac916adc]
>>  5: (boost::statechart::simple_state<PG::RecoveryState::Active,
>> PG::RecoveryState::Primary, PG::RecoveryState::Activating,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>> se const&, void const*)+0x179) [0x7f21a
>> c974909]
>>  6: (boost::statechart::simple_state<PG::RecoveryState::Activating,
>> PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
>> mpl_::na, mpl_::na, mpl_:
>> :na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_ba
>> se const&, void const*)+0xcd) [0x7f21ac977ccd]
>>  7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
>> PG::RecoveryState::Initial, std::allocator<void>,boost::statechart::null_exception_translator>::send_event(boost::statechart
>> ::event_base const&)+0x6b) [0x7f21ac95
>> d9cb]
>>  8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
>> PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
>>  9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
>> const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
>>  10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&,
>> ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552]
>>  11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e]
>>  12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350]
>>  13: (()+0x7dc5) [0x7f21aac04dc5]
>>  14: (clone()+0x6d) [0x7f21a928f73d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
>> interpret this.
>>
>> it looks like it was looking for a hit_set with a timestamp
>> of 05:55:58.459084Z but the timestamp on the log lines is for
>> 3:10:31.921096. we have tried various options of the hit_set_count and
>> hit_set_period but the osds are still trying to find objects for timestamps
>> that don't exist. i've looked through the osd pg directories (i.e.
>> /var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects
>> but with wildly different timestamps.
>>
>> there are a few bugs on tracker that look like their related but we haven't
>> been able to find a fix from them yet. http://tracker.ceph.com/issues/13185
>> is the closest one but it hasn't given us a hint on a fix
>> yet. http://tracker.ceph.com/issues/14399
>> and http://tracker.ceph.com/issues/9732 also seem related but also haven't
>> helped. we have verified that all nodes are using the same timezone and have
>> consistent timestamps. we're currently looking at options to edit the
>> hit_set_setup() code to skip the hit_set_remove() call and the failing code.
>> this will give us empty hit sets and the thought is that we will then flush
>> the cache tier and remove it then add one back. still not sure if its going
>> to work.
>>
>> does anyone else have any other thoughts about how to handle this or why
>> this is happening? or what else we could do to get the osds back online?
>> this has crashed almost all of the cache tier osds and the cluster is
>> unusable at the moment.
>
> This looks like it's a variation on the earlier bug, but I'm not sure why
> you're hitting it an others haven't.  In the meantime, to get your cluster
> up, you just need to skip teh hitsets that won't load obc's.   This will
> leave the objects behind but it's not a big deal and can be cleaned up
> later.
>
> See github.com/liewegas/ceph/commit/wip-jewel-nocrash-gmt.  You can build
> that or you should see some (completely untested!) packages appear at
> shaman.ceph.com in an hour or so.
>
> Maybe Kefu knows why this might have triggered?

i pulled together a test at https://github.com/ceph/ceph/pull/13805,
to exercise the scenario where we upgrade from hammer to jewel with a
cache tier, but it failed to reproduce this issue.

maybe we should just skip removing the archive hitset object if it is
missing locally. see https://github.com/ceph/ceph/pull/13821.

>
> sage

-- 
Regards
Kefu Chai