Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

Dan van der Ster <dvanders@xxxxxxxxx> · Tue, 25 Jan 2022 11:56:38 +0100

Hi Benjamin,

Apologies that I can't help for the bluestore issue.

But that huge 100GB OSD consumption could be related to similar
reports linked here: https://tracker.ceph.com/issues/53729

Does your cluster have the pglog_hardlimit set?

# ceph osd dump | grep pglog
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

Do you have PGs with really long pglogs?

# ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail

-- Dan

On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
<bstaffin@xxxxxxxxxxxxxxx> wrote:
>
> I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> same stack trace (see pasted output below).  The cluster is in a very bad
> state with this many OSDs down, unsurprisingly.
>
> The day before this problem showed up, the k8s cluster was under extreme
> memory pressure and a lot of pods were OOM killed, including some of the
> Ceph OSDs, but after the memory pressure abated everything seemed to
> stabilize for about a day.
>
> Then we attempted to set a 4gb memory limit on the OSD pods, because they
> had been using upwards of 100gb of ram(!) per OSD after about a month of
> uptime, and this was a contributing factor in the cluster-wide OOM
> situation.  Everything seemed fine for a few minutes after Rook rolled out
> the memory limit, but then OSDs gradually started to crash, a few at a
> time, up to about 30 of them.  At this point I reverted the memory limit,
> but I don't think the OSDs were hitting their memory limits at all.  In an
> attempt to stabilize the cluster, we eventually the Rook operator and set
> the osd norebalance, nobackfill, noout, and norecover flags, but at this
> point there were 46 OSDs down and pools were hitting BackFillFull.
>
> This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> fine with a 4gb memory target, right?).  The crash we're seeing looks very
> much like the one in this bug report: https://tracker.ceph.com/issues/52220
>
> I don't know how to proceed from here, so any advice would be very much
> appreciated.
>
> Ceph version: 16.2.6
> Rook version: 1.7.6
> Kubernetes version: 1.21.5
> Kernel version: 5.4.156-1.el7.elrepo.x86_64
> Distro: CentOS 7.9
>
> I've also attached the full log output from one of the crashing OSDs, in
> case that is of any use.
>
> ----begin stack trace paste----
> debug     -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int,
> ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> 2022-01-24T22:09:09.398961+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
>
>  ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x564f88db554c]
>  2: ceph-osd(+0x56a766) [0x564f88db5766]
>  3: (ECUtil::HashInfo::append(unsigned long, std::map<int,
> ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int
> const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
>  4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>,
> std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list,
> unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned
> long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
>  5: ceph-osd(+0xa5a611) [0x564f892a5611]
>  6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t
> const&, std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&,
> std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
>  7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
>  8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
>  9: (CallClientContexts::finish(std::pair<RecoveryMessages*,
> ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
>  10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> RecoveryMessages*)+0x8f) [0x564f8926dfaf]
>  11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
>  12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f)
> [0x564f89287bdf]
>  13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52)
> [0x564f8908dd12]
>  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
>  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309)
> [0x564f88eba1b9]
>  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
>  17: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
>  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x564f895456c4]
>  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
>  20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a]
>  21: clone()
>
> debug      0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught
> signal (Aborted) **
>  in thread 7ff8b4315700 thread_name:tp_osd_tp
> ----end paste----
>
> # ceph status
>   cluster:
>     id:     a262fadd-b995-4861-9cb0-06c1f1eddaf7
>     health: HEALTH_ERR
>             1 MDSs report slow metadata IOs
>             1277/19235302 objects unfound (0.007%)
>             noout,nobackfill,norebalance,norecover flag(s) set
>             1 backfillfull osd(s)
>             46 osds down
>             15 nearfull osd(s)
>             Reduced data availability: 1470 pgs inactive, 577 pgs down,
> 1615 pgs stale
>             Possible data damage: 22 pgs recovery_unfound
>             Degraded data redundancy: 18956079/115409079 objects degraded
> (16.425%), 1942 pgs degraded, 1941 pgs undersized
>             13 pool(s) backfillfull
>             1309 daemons have recently crashed
>
>   services:
>     mon: 3 daemons, quorum b,c,d (age 2d)
>     mgr: a(active, since 2d)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs
>          flags noout,nobackfill,norebalance,norecover
>
>   data:
>     volumes: 1/1 healthy
>     pools:   13 pools, 3234 pgs
>     objects: 19.24M objects, 72 TiB
>     usage:   80 TiB used, 41 TiB / 122 TiB avail
>     pgs:     45.455% pgs not active
>              18956079/115409079 objects degraded (16.425%)
>              2329606/115409079 objects misplaced (2.019%)
>              1277/19235302 objects unfound (0.007%)
>              326 undersized+degraded+remapped+backfill_wait+peered
>              325 active+clean
>              325 stale+active+undersized+degraded+remapped+backfill_wait
>              319 active+undersized+degraded+remapped+backfill_wait
>              311 stale+undersized+degraded+remapped+backfill_wait+peered
>              302 stale+active+clean
>              278 stale+down+remapped
>              217 down+remapped
>              149 active+recovery_wait+undersized+degraded+remapped
>              127 stale+active+recovery_wait+undersized+degraded+remapped
>              119 stale+recovery_wait+undersized+degraded+remapped+peered
>              107 recovery_wait+undersized+degraded+remapped+peered
>              57  active+undersized+degraded
>              50  stale+active+undersized+degraded
>              46  down
>              36  stale+down
>              31  active+remapped+backfill_wait
>              30  stale+active+remapped+backfill_wait
>              10  active+recovery_unfound+degraded
>              9   stale+active+undersized+remapped+backfill_wait
>              7   stale+undersized+degraded+remapped+backfilling+peered
>              6   active+undersized
>              6   undersized+degraded+remapped+backfilling+peered
>              5   active+undersized+remapped+backfill_wait
>              5   stale+active+recovery_unfound+degraded
>              4   stale+active+recovery_wait+degraded
>              4   active+recovery_wait+degraded+remapped
>              3   undersized+remapped+backfill_wait+peered
>              3   recovery_unfound+undersized+degraded+remapped+peered
>              3   stale+recovery_unfound+undersized+degraded+remapped+peered
>              3   stale+undersized+degraded+peered
>              3   stale+undersized+remapped+backfill_wait+peered
>              2   active+recovery_wait+degraded
>              2   stale+active+recovery_wait+degraded+remapped
>              1   recovery_wait+undersized+degraded+peered
>              1   active+recovery_unfound+undersized+degraded
>              1   active+clean+remapped
>              1   stale+recovery_wait+undersized+degraded+peered
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx