I have a cluster where 46 out of 120 OSDs have begun crash looping with the same stack trace (see pasted output below). The cluster is in a very bad state with this many OSDs down, unsurprisingly. The day before this problem showed up, the k8s cluster was under extreme memory pressure and a lot of pods were OOM killed, including some of the Ceph OSDs, but after the memory pressure abated everything seemed to stabilize for about a day. Then we attempted to set a 4gb memory limit on the OSD pods, because they had been using upwards of 100gb of ram(!) per OSD after about a month of uptime, and this was a contributing factor in the cluster-wide OOM situation. Everything seemed fine for a few minutes after Rook rolled out the memory limit, but then OSDs gradually started to crash, a few at a time, up to about 30 of them. At this point I reverted the memory limit, but I don't think the OSDs were hitting their memory limits at all. In an attempt to stabilize the cluster, we eventually the Rook operator and set the osd norebalance, nobackfill, noout, and norecover flags, but at this point there were 46 OSDs down and pools were hitting BackFillFull. This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12 nodes. Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5 BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be fine with a 4gb memory target, right?). The crash we're seeing looks very much like the one in this bug report: https://tracker.ceph.com/issues/52220 I don't know how to proceed from here, so any advice would be very much appreciated. Ceph version: 16.2.6 Rook version: 1.7.6 Kubernetes version: 1.21.5 Kernel version: 5.4.156-1.el7.elrepo.x86_64 Distro: CentOS 7.9 I've also attached the full log output from one of the crashing OSDs, in case that is of any use. ----begin stack trace paste---- debug -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int, ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time 2022-01-24T22:09:09.398961+0000 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size()) ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158) [0x564f88db554c] 2: ceph-osd(+0x56a766) [0x564f88db5766] 3: (ECUtil::HashInfo::append(unsigned long, std::map<int, ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b] 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&, std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>, std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list, unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t, ceph::os::Transaction, std::less<shard_id_t>, std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c] 5: ceph-osd(+0xa5a611) [0x564f892a5611] 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&, std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t const&, std::map<hobject_t, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, std::map<hobject_t, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, std::allocator<std::pair<hobject_t const, interval_map<unsigned long, ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t, ceph::os::Transaction, std::less<shard_id_t>, std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb] 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28] 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4] 9: (CallClientContexts::finish(std::pair<RecoveryMessages*, ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338] 10: (ECBackend::complete_read_op(ECBackend::ReadOp&, RecoveryMessages*)+0x8f) [0x564f8926dfaf] 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106] 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) [0x564f89287bdf] 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) [0x564f8908dd12] 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e] 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) [0x564f88eba1b9] 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868] 17: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8] 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) [0x564f895456c4] 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364] 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a] 21: clone() debug 0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught signal (Aborted) ** in thread 7ff8b4315700 thread_name:tp_osd_tp ----end paste---- # ceph status cluster: id: a262fadd-b995-4861-9cb0-06c1f1eddaf7 health: HEALTH_ERR 1 MDSs report slow metadata IOs 1277/19235302 objects unfound (0.007%) noout,nobackfill,norebalance,norecover flag(s) set 1 backfillfull osd(s) 46 osds down 15 nearfull osd(s) Reduced data availability: 1470 pgs inactive, 577 pgs down, 1615 pgs stale Possible data damage: 22 pgs recovery_unfound Degraded data redundancy: 18956079/115409079 objects degraded (16.425%), 1942 pgs degraded, 1941 pgs undersized 13 pool(s) backfillfull 1309 daemons have recently crashed services: mon: 3 daemons, quorum b,c,d (age 2d) mgr: a(active, since 2d) mds: 1/1 daemons up, 1 hot standby osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs flags noout,nobackfill,norebalance,norecover data: volumes: 1/1 healthy pools: 13 pools, 3234 pgs objects: 19.24M objects, 72 TiB usage: 80 TiB used, 41 TiB / 122 TiB avail pgs: 45.455% pgs not active 18956079/115409079 objects degraded (16.425%) 2329606/115409079 objects misplaced (2.019%) 1277/19235302 objects unfound (0.007%) 326 undersized+degraded+remapped+backfill_wait+peered 325 active+clean 325 stale+active+undersized+degraded+remapped+backfill_wait 319 active+undersized+degraded+remapped+backfill_wait 311 stale+undersized+degraded+remapped+backfill_wait+peered 302 stale+active+clean 278 stale+down+remapped 217 down+remapped 149 active+recovery_wait+undersized+degraded+remapped 127 stale+active+recovery_wait+undersized+degraded+remapped 119 stale+recovery_wait+undersized+degraded+remapped+peered 107 recovery_wait+undersized+degraded+remapped+peered 57 active+undersized+degraded 50 stale+active+undersized+degraded 46 down 36 stale+down 31 active+remapped+backfill_wait 30 stale+active+remapped+backfill_wait 10 active+recovery_unfound+degraded 9 stale+active+undersized+remapped+backfill_wait 7 stale+undersized+degraded+remapped+backfilling+peered 6 active+undersized 6 undersized+degraded+remapped+backfilling+peered 5 active+undersized+remapped+backfill_wait 5 stale+active+recovery_unfound+degraded 4 stale+active+recovery_wait+degraded 4 active+recovery_wait+degraded+remapped 3 undersized+remapped+backfill_wait+peered 3 recovery_unfound+undersized+degraded+remapped+peered 3 stale+recovery_unfound+undersized+degraded+remapped+peered 3 stale+undersized+degraded+peered 3 stale+undersized+remapped+backfill_wait+peered 2 active+recovery_wait+degraded 2 stale+active+recovery_wait+degraded+remapped 1 recovery_wait+undersized+degraded+peered 1 active+recovery_unfound+undersized+degraded 1 active+clean+remapped 1 stale+recovery_wait+undersized+degraded+peered _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx