Hi Benjamin, Apologies that I can't help for the bluestore issue. But that huge 100GB OSD consumption could be related to similar reports linked here: https://tracker.ceph.com/issues/53729 Does your cluster have the pglog_hardlimit set? # ceph osd dump | grep pglog flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit Do you have PGs with really long pglogs? # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail -- Dan On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin <bstaffin@xxxxxxxxxxxxxxx> wrote: > > I have a cluster where 46 out of 120 OSDs have begun crash looping with the > same stack trace (see pasted output below). The cluster is in a very bad > state with this many OSDs down, unsurprisingly. > > The day before this problem showed up, the k8s cluster was under extreme > memory pressure and a lot of pods were OOM killed, including some of the > Ceph OSDs, but after the memory pressure abated everything seemed to > stabilize for about a day. > > Then we attempted to set a 4gb memory limit on the OSD pods, because they > had been using upwards of 100gb of ram(!) per OSD after about a month of > uptime, and this was a contributing factor in the cluster-wide OOM > situation. Everything seemed fine for a few minutes after Rook rolled out > the memory limit, but then OSDs gradually started to crash, a few at a > time, up to about 30 of them. At this point I reverted the memory limit, > but I don't think the OSDs were hitting their memory limits at all. In an > attempt to stabilize the cluster, we eventually the Rook operator and set > the osd norebalance, nobackfill, noout, and norecover flags, but at this > point there were 46 OSDs down and pools were hitting BackFillFull. > > This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12 > nodes. Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5 > BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be > fine with a 4gb memory target, right?). The crash we're seeing looks very > much like the one in this bug report: https://tracker.ceph.com/issues/52220 > > I don't know how to proceed from here, so any advice would be very much > appreciated. > > Ceph version: 16.2.6 > Rook version: 1.7.6 > Kubernetes version: 1.21.5 > Kernel version: 5.4.156-1.el7.elrepo.x86_64 > Distro: CentOS 7.9 > > I've also attached the full log output from one of the crashing OSDs, in > case that is of any use. > > ----begin stack trace paste---- > debug -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: > In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int, > ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time > 2022-01-24T22:09:09.398961+0000 > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: > 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size()) > > ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific > (stable) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x158) [0x564f88db554c] > 2: ceph-osd(+0x56a766) [0x564f88db5766] > 3: (ECUtil::HashInfo::append(unsigned long, std::map<int, > ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int > const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b] > 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&, > std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>, > std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list, > unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned > long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t, > ceph::os::Transaction, std::less<shard_id_t>, > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, > DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c] > 5: ceph-osd(+0xa5a611) [0x564f892a5611] > 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&, > std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t > const&, std::map<hobject_t, interval_map<unsigned long, > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, > std::allocator<std::pair<hobject_t const, interval_map<unsigned long, > ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&, > std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, > std::map<hobject_t, interval_map<unsigned long, > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, > std::allocator<std::pair<hobject_t const, interval_map<unsigned long, > ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t, > ceph::os::Transaction, std::less<shard_id_t>, > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, > DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb] > 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28] > 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4] > 9: (CallClientContexts::finish(std::pair<RecoveryMessages*, > ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338] > 10: (ECBackend::complete_read_op(ECBackend::ReadOp&, > RecoveryMessages*)+0x8f) [0x564f8926dfaf] > 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, > RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106] > 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) > [0x564f89287bdf] > 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) > [0x564f8908dd12] > 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e] > 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, > boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) > [0x564f88eba1b9] > 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868] > 17: (OSD::ShardedOpWQ::_process(unsigned int, > ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8] > 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) > [0x564f895456c4] > 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364] > 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a] > 21: clone() > > debug 0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught > signal (Aborted) ** > in thread 7ff8b4315700 thread_name:tp_osd_tp > ----end paste---- > > # ceph status > cluster: > id: a262fadd-b995-4861-9cb0-06c1f1eddaf7 > health: HEALTH_ERR > 1 MDSs report slow metadata IOs > 1277/19235302 objects unfound (0.007%) > noout,nobackfill,norebalance,norecover flag(s) set > 1 backfillfull osd(s) > 46 osds down > 15 nearfull osd(s) > Reduced data availability: 1470 pgs inactive, 577 pgs down, > 1615 pgs stale > Possible data damage: 22 pgs recovery_unfound > Degraded data redundancy: 18956079/115409079 objects degraded > (16.425%), 1942 pgs degraded, 1941 pgs undersized > 13 pool(s) backfillfull > 1309 daemons have recently crashed > > services: > mon: 3 daemons, quorum b,c,d (age 2d) > mgr: a(active, since 2d) > mds: 1/1 daemons up, 1 hot standby > osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs > flags noout,nobackfill,norebalance,norecover > > data: > volumes: 1/1 healthy > pools: 13 pools, 3234 pgs > objects: 19.24M objects, 72 TiB > usage: 80 TiB used, 41 TiB / 122 TiB avail > pgs: 45.455% pgs not active > 18956079/115409079 objects degraded (16.425%) > 2329606/115409079 objects misplaced (2.019%) > 1277/19235302 objects unfound (0.007%) > 326 undersized+degraded+remapped+backfill_wait+peered > 325 active+clean > 325 stale+active+undersized+degraded+remapped+backfill_wait > 319 active+undersized+degraded+remapped+backfill_wait > 311 stale+undersized+degraded+remapped+backfill_wait+peered > 302 stale+active+clean > 278 stale+down+remapped > 217 down+remapped > 149 active+recovery_wait+undersized+degraded+remapped > 127 stale+active+recovery_wait+undersized+degraded+remapped > 119 stale+recovery_wait+undersized+degraded+remapped+peered > 107 recovery_wait+undersized+degraded+remapped+peered > 57 active+undersized+degraded > 50 stale+active+undersized+degraded > 46 down > 36 stale+down > 31 active+remapped+backfill_wait > 30 stale+active+remapped+backfill_wait > 10 active+recovery_unfound+degraded > 9 stale+active+undersized+remapped+backfill_wait > 7 stale+undersized+degraded+remapped+backfilling+peered > 6 active+undersized > 6 undersized+degraded+remapped+backfilling+peered > 5 active+undersized+remapped+backfill_wait > 5 stale+active+recovery_unfound+degraded > 4 stale+active+recovery_wait+degraded > 4 active+recovery_wait+degraded+remapped > 3 undersized+remapped+backfill_wait+peered > 3 recovery_unfound+undersized+degraded+remapped+peered > 3 stale+recovery_unfound+undersized+degraded+remapped+peered > 3 stale+undersized+degraded+peered > 3 stale+undersized+remapped+backfill_wait+peered > 2 active+recovery_wait+degraded > 2 stale+active+recovery_wait+degraded+remapped > 1 recovery_wait+undersized+degraded+peered > 1 active+recovery_unfound+undersized+degraded > 1 active+clean+remapped > 1 stale+recovery_wait+undersized+degraded+peered > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx