Re: Lots of OSDs crashlooping (DRAFT - feedback?)

Konstantin Shalygin <k0ste@xxxxxxxx> · Tue, 25 Jan 2022 09:03:55 +0300



Looks like this tracker [1] created by Telemetry


[1] https://tracker.ceph.com/issues/52220 <https://tracker.ceph.com/issues/52220>

k

> On 25 Jan 2022, at 02:44, Benjamin Staffin <bstaffin@xxxxxxxxxxxxxxx> wrote:
> 
> I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> same stack trace (see pasted output below).  The cluster is in a very bad
> state with this many OSDs down, unsurprisingly.
> 
> The day before this problem showed up, the k8s cluster was under extreme
> memory pressure and a lot of pods were OOM killed, including some of the
> Ceph OSDs, but after the memory pressure abated everything seemed to
> stabilize for about a day.
> 
> Then we attempted to set a 4gb memory limit on the OSD pods, because they
> had been using upwards of 100gb of ram(!) per OSD after about a month of
> uptime, and this was a contributing factor in the cluster-wide OOM
> situation.  Everything seemed fine for a few minutes after Rook rolled out
> the memory limit, but then OSDs gradually started to crash, a few at a
> time, up to about 30 of them.  At this point I reverted the memory limit,
> but I don't think the OSDs were hitting their memory limits at all.  In an
> attempt to stabilize the cluster, we eventually the Rook operator and set
> the osd norebalance, nobackfill, noout, and norecover flags, but at this
> point there were 46 OSDs down and pools were hitting BackFillFull.
> 
> This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> fine with a 4gb memory target, right?).  The crash we're seeing looks very
> much like the one in this bug report: https://tracker.ceph.com/issues/52220
> 
> I don't know how to proceed from here, so any advice would be very much
> appreciated.
> 
> Ceph version: 16.2.6
> Rook version: 1.7.6
> Kubernetes version: 1.21.5
> Kernel version: 5.4.156-1.el7.elrepo.x86_64
> Distro: CentOS 7.9
> 
> I've also attached the full log output from one of the crashing OSDs, in
> case that is of any use.
> 
> ----begin stack trace paste----
> debug     -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int,
> ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> 2022-01-24T22:09:09.398961+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
> 
> ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x564f88db554c]
> 2: ceph-osd(+0x56a766) [0x564f88db5766]
> 3: (ECUtil::HashInfo::append(unsigned long, std::map<int,
> ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int
> const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
> 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>,
> std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list,
> unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned
> long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
> 5: ceph-osd(+0xa5a611) [0x564f892a5611]
> 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t
> const&, std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&,
> std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
> 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
> 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
> 9: (CallClientContexts::finish(std::pair<RecoveryMessages*,
> ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
> 10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> RecoveryMessages*)+0x8f) [0x564f8926dfaf]
> 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
> 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f)
> [0x564f89287bdf]
> 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52)
> [0x564f8908dd12]
> 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
> 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309)
> [0x564f88eba1b9]
> 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
> 17: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
> 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x564f895456c4]
> 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
> 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a]
> 21: clone()
> 
> debug      0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught
> signal (Aborted) **
> in thread 7ff8b4315700 thread_name:tp_osd_tp
> ----end paste----
> 
> # ceph status
>  cluster:
>    id:     a262fadd-b995-4861-9cb0-06c1f1eddaf7
>    health: HEALTH_ERR
>            1 MDSs report slow metadata IOs
>            1277/19235302 objects unfound (0.007%)
>            noout,nobackfill,norebalance,norecover flag(s) set
>            1 backfillfull osd(s)
>            46 osds down
>            15 nearfull osd(s)
>            Reduced data availability: 1470 pgs inactive, 577 pgs down,
> 1615 pgs stale
>            Possible data damage: 22 pgs recovery_unfound
>            Degraded data redundancy: 18956079/115409079 objects degraded
> (16.425%), 1942 pgs degraded, 1941 pgs undersized
>            13 pool(s) backfillfull
>            1309 daemons have recently crashed
> 
>  services:
>    mon: 3 daemons, quorum b,c,d (age 2d)
>    mgr: a(active, since 2d)
>    mds: 1/1 daemons up, 1 hot standby
>    osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs
>         flags noout,nobackfill,norebalance,norecover
> 
>  data:
>    volumes: 1/1 healthy
>    pools:   13 pools, 3234 pgs
>    objects: 19.24M objects, 72 TiB
>    usage:   80 TiB used, 41 TiB / 122 TiB avail
>    pgs:     45.455% pgs not active
>             18956079/115409079 objects degraded (16.425%)
>             2329606/115409079 objects misplaced (2.019%)
>             1277/19235302 objects unfound (0.007%)
>             326 undersized+degraded+remapped+backfill_wait+peered
>             325 active+clean
>             325 stale+active+undersized+degraded+remapped+backfill_wait
>             319 active+undersized+degraded+remapped+backfill_wait
>             311 stale+undersized+degraded+remapped+backfill_wait+peered
>             302 stale+active+clean
>             278 stale+down+remapped
>             217 down+remapped
>             149 active+recovery_wait+undersized+degraded+remapped
>             127 stale+active+recovery_wait+undersized+degraded+remapped
>             119 stale+recovery_wait+undersized+degraded+remapped+peered
>             107 recovery_wait+undersized+degraded+remapped+peered
>             57  active+undersized+degraded
>             50  stale+active+undersized+degraded
>             46  down
>             36  stale+down
>             31  active+remapped+backfill_wait
>             30  stale+active+remapped+backfill_wait
>             10  active+recovery_unfound+degraded
>             9   stale+active+undersized+remapped+backfill_wait
>             7   stale+undersized+degraded+remapped+backfilling+peered
>             6   active+undersized
>             6   undersized+degraded+remapped+backfilling+peered
>             5   active+undersized+remapped+backfill_wait
>             5   stale+active+recovery_unfound+degraded
>             4   stale+active+recovery_wait+degraded
>             4   active+recovery_wait+degraded+remapped
>             3   undersized+remapped+backfill_wait+peered
>             3   recovery_unfound+undersized+degraded+remapped+peered
>             3   stale+recovery_unfound+undersized+degraded+remapped+peered
>             3   stale+undersized+degraded+peered
>             3   stale+undersized+remapped+backfill_wait+peered
>             2   active+recovery_wait+degraded
>             2   stale+active+recovery_wait+degraded+remapped
>             1   recovery_wait+undersized+degraded+peered
>             1   active+recovery_unfound+undersized+degraded
>             1   active+clean+remapped
>             1   stale+recovery_wait+undersized+degraded+peered
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx