On Tue, Jan 25, 2022 at 4:07 PM Frank Schilder <frans@xxxxxx> wrote: > > Hi Dan, > > in several threads I have now seen statements like "Does your cluster have the pglog_hardlimit set?". In this context, I would be grateful if you could shed some light on the following: > > 1) How do I check that? > > There is no equivalent "osd get pglog_hardlimit". I showed how to query for it: # ceph osd dump | grep pglog flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit > > 2) What is the recommendation? Since pacific it should be on by default, but I haven't had any user confirm this fact. (On our clusters we have enabled it manually when it was added to nautilus). > > In the ceph documentation, the only occurrence of the term pglog_hardlimit are release notes for luminous and mimic, stating (mimic) > > > A flag called pglog_hardlimit has been introduced, which is off by default. Enabling this flag will limit the > > length of the pg log. In order to enable that, the flag must be set by running ceph osd set pglog_hardlimit > > after completely upgrading to 13.2.2. Once the cluster has this flag set, the length of the pg log will be > > capped by a hard limit. Once set, this flag must not be unset anymore. In luminous, this feature was > > introduced in 12.2.11. Users who are running 12.2.11, and want to continue to use this feature, should > > upgrade to 13.2.5 or later. > > How do I know if I want to use this feature? I would need a bit of information about pros and cons. Or should one have this enabled in any case? Would be great if you could provide some insight here. Normally a pg log with even 10000 entries consumes just a couple hundred MBs of memory. (See the osd_pglog mempool). The pg log length can be queried like I showed earlier: # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail (those are the LOG colums in the pg output). In the past I've seen pg logs with millions of entries. Those are surely a root cause for huge memory usage, especially at OSD boot time. Such pglogs would need to be trimmed, e.g. with the ceph-objectstore-tool recipes that have been shared around on the list. The pglog_hardlimit is meant to limit the growth of the PG log. On the other hand: it is clear that even with reasonably sized PG logs, the memory can balloon for some unknown reason. The devs have asked a couple times for dumps of those logs replaying huge-memory causing pglogs. In this case -- Benjamin's issue -- I'm trying to understand if this is related to: * a huge pg log -- would need trimming -- perhaps the pglog_hardlimit isn't on by default as designed * normal sized pg log, with some entries that are consuming huge amounts of memory (due to a yet-unsolved bug). Thanks, Dan > > Thanks and best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Dan van der Ster <dvanders@xxxxxxxxx> > Sent: 25 January 2022 11:56:38 > To: Benjamin Staffin > Cc: Ceph Users; Matthew Wilder; Tara Fly > Subject: Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?) > > Hi Benjamin, > > Apologies that I can't help for the bluestore issue. > > But that huge 100GB OSD consumption could be related to similar > reports linked here: https://tracker.ceph.com/issues/53729 > > Does your cluster have the pglog_hardlimit set? > > # ceph osd dump | grep pglog > flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit > > Do you have PGs with really long pglogs? > > # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail > > > > -- Dan > > On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin > <bstaffin@xxxxxxxxxxxxxxx> wrote: > > > > I have a cluster where 46 out of 120 OSDs have begun crash looping with the > > same stack trace (see pasted output below). The cluster is in a very bad > > state with this many OSDs down, unsurprisingly. > > > > The day before this problem showed up, the k8s cluster was under extreme > > memory pressure and a lot of pods were OOM killed, including some of the > > Ceph OSDs, but after the memory pressure abated everything seemed to > > stabilize for about a day. > > > > Then we attempted to set a 4gb memory limit on the OSD pods, because they > > had been using upwards of 100gb of ram(!) per OSD after about a month of > > uptime, and this was a contributing factor in the cluster-wide OOM > > situation. Everything seemed fine for a few minutes after Rook rolled out > > the memory limit, but then OSDs gradually started to crash, a few at a > > time, up to about 30 of them. At this point I reverted the memory limit, > > but I don't think the OSDs were hitting their memory limits at all. In an > > attempt to stabilize the cluster, we eventually the Rook operator and set > > the osd norebalance, nobackfill, noout, and norecover flags, but at this > > point there were 46 OSDs down and pools were hitting BackFillFull. > > > > This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12 > > nodes. Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5 > > BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be > > fine with a 4gb memory target, right?). The crash we're seeing looks very > > much like the one in this bug report: https://tracker.ceph.com/issues/52220 > > > > I don't know how to proceed from here, so any advice would be very much > > appreciated. > > > > Ceph version: 16.2.6 > > Rook version: 1.7.6 > > Kubernetes version: 1.21.5 > > Kernel version: 5.4.156-1.el7.elrepo.x86_64 > > Distro: CentOS 7.9 > > > > I've also attached the full log output from one of the crashing OSDs, in > > case that is of any use. > > > > ----begin stack trace paste---- > > debug -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: > > In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int, > > ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time > > 2022-01-24T22:09:09.398961+0000 > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc: > > 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size()) > > > > ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific > > (stable) > > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > > const*)+0x158) [0x564f88db554c] > > 2: ceph-osd(+0x56a766) [0x564f88db5766] > > 3: (ECUtil::HashInfo::append(unsigned long, std::map<int, > > ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int > > const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b] > > 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&, > > std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>, > > std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list, > > unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned > > long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t, > > ceph::os::Transaction, std::less<shard_id_t>, > > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, > > DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c] > > 5: ceph-osd(+0xa5a611) [0x564f892a5611] > > 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&, > > std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t > > const&, std::map<hobject_t, interval_map<unsigned long, > > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, > > std::allocator<std::pair<hobject_t const, interval_map<unsigned long, > > ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&, > > std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&, > > std::map<hobject_t, interval_map<unsigned long, > > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>, > > std::allocator<std::pair<hobject_t const, interval_map<unsigned long, > > ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t, > > ceph::os::Transaction, std::less<shard_id_t>, > > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*, > > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, > > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*, > > DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb] > > 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28] > > 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4] > > 9: (CallClientContexts::finish(std::pair<RecoveryMessages*, > > ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338] > > 10: (ECBackend::complete_read_op(ECBackend::ReadOp&, > > RecoveryMessages*)+0x8f) [0x564f8926dfaf] > > 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&, > > RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106] > > 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f) > > [0x564f89287bdf] > > 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52) > > [0x564f8908dd12] > > 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, > > ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e] > > 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>, > > boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309) > > [0x564f88eba1b9] > > 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, > > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868] > > 17: (OSD::ShardedOpWQ::_process(unsigned int, > > ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8] > > 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4) > > [0x564f895456c4] > > 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364] > > 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a] > > 21: clone() > > > > debug 0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught > > signal (Aborted) ** > > in thread 7ff8b4315700 thread_name:tp_osd_tp > > ----end paste---- > > > > # ceph status > > cluster: > > id: a262fadd-b995-4861-9cb0-06c1f1eddaf7 > > health: HEALTH_ERR > > 1 MDSs report slow metadata IOs > > 1277/19235302 objects unfound (0.007%) > > noout,nobackfill,norebalance,norecover flag(s) set > > 1 backfillfull osd(s) > > 46 osds down > > 15 nearfull osd(s) > > Reduced data availability: 1470 pgs inactive, 577 pgs down, > > 1615 pgs stale > > Possible data damage: 22 pgs recovery_unfound > > Degraded data redundancy: 18956079/115409079 objects degraded > > (16.425%), 1942 pgs degraded, 1941 pgs undersized > > 13 pool(s) backfillfull > > 1309 daemons have recently crashed > > > > services: > > mon: 3 daemons, quorum b,c,d (age 2d) > > mgr: a(active, since 2d) > > mds: 1/1 daemons up, 1 hot standby > > osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs > > flags noout,nobackfill,norebalance,norecover > > > > data: > > volumes: 1/1 healthy > > pools: 13 pools, 3234 pgs > > objects: 19.24M objects, 72 TiB > > usage: 80 TiB used, 41 TiB / 122 TiB avail > > pgs: 45.455% pgs not active > > 18956079/115409079 objects degraded (16.425%) > > 2329606/115409079 objects misplaced (2.019%) > > 1277/19235302 objects unfound (0.007%) > > 326 undersized+degraded+remapped+backfill_wait+peered > > 325 active+clean > > 325 stale+active+undersized+degraded+remapped+backfill_wait > > 319 active+undersized+degraded+remapped+backfill_wait > > 311 stale+undersized+degraded+remapped+backfill_wait+peered > > 302 stale+active+clean > > 278 stale+down+remapped > > 217 down+remapped > > 149 active+recovery_wait+undersized+degraded+remapped > > 127 stale+active+recovery_wait+undersized+degraded+remapped > > 119 stale+recovery_wait+undersized+degraded+remapped+peered > > 107 recovery_wait+undersized+degraded+remapped+peered > > 57 active+undersized+degraded > > 50 stale+active+undersized+degraded > > 46 down > > 36 stale+down > > 31 active+remapped+backfill_wait > > 30 stale+active+remapped+backfill_wait > > 10 active+recovery_unfound+degraded > > 9 stale+active+undersized+remapped+backfill_wait > > 7 stale+undersized+degraded+remapped+backfilling+peered > > 6 active+undersized > > 6 undersized+degraded+remapped+backfilling+peered > > 5 active+undersized+remapped+backfill_wait > > 5 stale+active+recovery_unfound+degraded > > 4 stale+active+recovery_wait+degraded > > 4 active+recovery_wait+degraded+remapped > > 3 undersized+remapped+backfill_wait+peered > > 3 recovery_unfound+undersized+degraded+remapped+peered > > 3 stale+recovery_unfound+undersized+degraded+remapped+peered > > 3 stale+undersized+degraded+peered > > 3 stale+undersized+remapped+backfill_wait+peered > > 2 active+recovery_wait+degraded > > 2 stale+active+recovery_wait+degraded+remapped > > 1 recovery_wait+undersized+degraded+peered > > 1 active+recovery_unfound+undersized+degraded > > 1 active+clean+remapped > > 1 stale+recovery_wait+undersized+degraded+peered > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx