Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue, Jan 25, 2022 at 4:07 PM Frank Schilder <frans@xxxxxx> wrote:
>
> Hi Dan,
>
> in several threads I have now seen statements like "Does your cluster have the pglog_hardlimit set?". In this context, I would be grateful if you could shed some light on the following:
>
> 1) How do I check that?
>
> There is no equivalent "osd get pglog_hardlimit".

I showed how to query for it:

# ceph osd dump | grep pglog
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

>
> 2) What is the recommendation?

Since pacific it should be on by default, but I haven't had any user
confirm this fact.
(On our clusters we have enabled it manually when it was added to nautilus).

>
> In the ceph documentation, the only occurrence of the term pglog_hardlimit are release notes for luminous and mimic, stating (mimic)
>
> > A flag called pglog_hardlimit has been introduced, which is off by default. Enabling this flag will limit the
> > length of the pg log. In order to enable that, the flag must be set by running ceph osd set pglog_hardlimit
> > after completely upgrading to 13.2.2. Once the cluster has this flag set, the length of the pg log will be
> > capped by a hard limit. Once set, this flag must not be unset anymore. In luminous, this feature was
> > introduced in 12.2.11. Users who are running 12.2.11, and want to continue to use this feature, should
> > upgrade to 13.2.5 or later.
>
> How do I know if I want to use this feature? I would need a bit of information about pros and cons. Or should one have this enabled in any case? Would be great if you could provide some insight here.

Normally a pg log with even 10000 entries consumes just a couple
hundred MBs of memory. (See the osd_pglog mempool).
The pg log length can be queried like I showed earlier:

# ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail

(those are the LOG colums in the pg output).

In the past I've seen pg logs with millions of entries. Those are
surely a root cause for huge memory usage, especially at OSD boot
time.
Such pglogs would need to be trimmed, e.g. with the
ceph-objectstore-tool recipes that have been shared around on the
list.
The pglog_hardlimit is meant to limit the growth of the PG log.

On the other hand: it is clear that even with reasonably sized PG
logs, the memory can balloon for some unknown reason.
The devs have asked a couple times for dumps of those logs replaying
huge-memory causing pglogs.

In this case -- Benjamin's issue -- I'm trying to understand if this
is related to:
* a huge pg log -- would need trimming -- perhaps the pglog_hardlimit
isn't on by default as designed
* normal sized pg log, with some entries that are consuming huge
amounts of memory (due to a yet-unsolved bug).

Thanks,
Dan



>
> Thanks and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Dan van der Ster <dvanders@xxxxxxxxx>
> Sent: 25 January 2022 11:56:38
> To: Benjamin Staffin
> Cc: Ceph Users; Matthew Wilder; Tara Fly
> Subject:  Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)
>
> Hi Benjamin,
>
> Apologies that I can't help for the bluestore issue.
>
> But that huge 100GB OSD consumption could be related to similar
> reports linked here: https://tracker.ceph.com/issues/53729
>
> Does your cluster have the pglog_hardlimit set?
>
> # ceph osd dump | grep pglog
> flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
>
> Do you have PGs with really long pglogs?
>
> # ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail
>
>
>
> -- Dan
>
> On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
> <bstaffin@xxxxxxxxxxxxxxx> wrote:
> >
> > I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> > same stack trace (see pasted output below).  The cluster is in a very bad
> > state with this many OSDs down, unsurprisingly.
> >
> > The day before this problem showed up, the k8s cluster was under extreme
> > memory pressure and a lot of pods were OOM killed, including some of the
> > Ceph OSDs, but after the memory pressure abated everything seemed to
> > stabilize for about a day.
> >
> > Then we attempted to set a 4gb memory limit on the OSD pods, because they
> > had been using upwards of 100gb of ram(!) per OSD after about a month of
> > uptime, and this was a contributing factor in the cluster-wide OOM
> > situation.  Everything seemed fine for a few minutes after Rook rolled out
> > the memory limit, but then OSDs gradually started to crash, a few at a
> > time, up to about 30 of them.  At this point I reverted the memory limit,
> > but I don't think the OSDs were hitting their memory limits at all.  In an
> > attempt to stabilize the cluster, we eventually the Rook operator and set
> > the osd norebalance, nobackfill, noout, and norecover flags, but at this
> > point there were 46 OSDs down and pools were hitting BackFillFull.
> >
> > This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> > nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> > BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> > fine with a 4gb memory target, right?).  The crash we're seeing looks very
> > much like the one in this bug report: https://tracker.ceph.com/issues/52220
> >
> > I don't know how to proceed from here, so any advice would be very much
> > appreciated.
> >
> > Ceph version: 16.2.6
> > Rook version: 1.7.6
> > Kubernetes version: 1.21.5
> > Kernel version: 5.4.156-1.el7.elrepo.x86_64
> > Distro: CentOS 7.9
> >
> > I've also attached the full log output from one of the crashing OSDs, in
> > case that is of any use.
> >
> > ----begin stack trace paste----
> > debug     -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> > In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int,
> > ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> > 2022-01-24T22:09:09.398961+0000
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
> > 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
> >
> >  ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x158) [0x564f88db554c]
> >  2: ceph-osd(+0x56a766) [0x564f88db5766]
> >  3: (ECUtil::HashInfo::append(unsigned long, std::map<int,
> > ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int
> > const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
> >  4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
> > std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>,
> > std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list,
> > unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned
> > long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t,
> > ceph::os::Transaction, std::less<shard_id_t>,
> > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> > DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
> >  5: ceph-osd(+0xa5a611) [0x564f892a5611]
> >  6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> > std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t
> > const&, std::map<hobject_t, interval_map<unsigned long,
> > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> > std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> > ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> > std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&,
> > std::map<hobject_t, interval_map<unsigned long,
> > ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> > std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> > ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t,
> > ceph::os::Transaction, std::less<shard_id_t>,
> > std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> > std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> > DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
> >  7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
> >  8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
> >  9: (CallClientContexts::finish(std::pair<RecoveryMessages*,
> > ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
> >  10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> > RecoveryMessages*)+0x8f) [0x564f8926dfaf]
> >  11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> > RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
> >  12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f)
> > [0x564f89287bdf]
> >  13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52)
> > [0x564f8908dd12]
> >  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> > ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
> >  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> > boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309)
> > [0x564f88eba1b9]
> >  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> > boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
> >  17: (OSD::ShardedOpWQ::_process(unsigned int,
> > ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
> >  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> > [0x564f895456c4]
> >  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
> >  20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a]
> >  21: clone()
> >
> > debug      0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught
> > signal (Aborted) **
> >  in thread 7ff8b4315700 thread_name:tp_osd_tp
> > ----end paste----
> >
> > # ceph status
> >   cluster:
> >     id:     a262fadd-b995-4861-9cb0-06c1f1eddaf7
> >     health: HEALTH_ERR
> >             1 MDSs report slow metadata IOs
> >             1277/19235302 objects unfound (0.007%)
> >             noout,nobackfill,norebalance,norecover flag(s) set
> >             1 backfillfull osd(s)
> >             46 osds down
> >             15 nearfull osd(s)
> >             Reduced data availability: 1470 pgs inactive, 577 pgs down,
> > 1615 pgs stale
> >             Possible data damage: 22 pgs recovery_unfound
> >             Degraded data redundancy: 18956079/115409079 objects degraded
> > (16.425%), 1942 pgs degraded, 1941 pgs undersized
> >             13 pool(s) backfillfull
> >             1309 daemons have recently crashed
> >
> >   services:
> >     mon: 3 daemons, quorum b,c,d (age 2d)
> >     mgr: a(active, since 2d)
> >     mds: 1/1 daemons up, 1 hot standby
> >     osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs
> >          flags noout,nobackfill,norebalance,norecover
> >
> >   data:
> >     volumes: 1/1 healthy
> >     pools:   13 pools, 3234 pgs
> >     objects: 19.24M objects, 72 TiB
> >     usage:   80 TiB used, 41 TiB / 122 TiB avail
> >     pgs:     45.455% pgs not active
> >              18956079/115409079 objects degraded (16.425%)
> >              2329606/115409079 objects misplaced (2.019%)
> >              1277/19235302 objects unfound (0.007%)
> >              326 undersized+degraded+remapped+backfill_wait+peered
> >              325 active+clean
> >              325 stale+active+undersized+degraded+remapped+backfill_wait
> >              319 active+undersized+degraded+remapped+backfill_wait
> >              311 stale+undersized+degraded+remapped+backfill_wait+peered
> >              302 stale+active+clean
> >              278 stale+down+remapped
> >              217 down+remapped
> >              149 active+recovery_wait+undersized+degraded+remapped
> >              127 stale+active+recovery_wait+undersized+degraded+remapped
> >              119 stale+recovery_wait+undersized+degraded+remapped+peered
> >              107 recovery_wait+undersized+degraded+remapped+peered
> >              57  active+undersized+degraded
> >              50  stale+active+undersized+degraded
> >              46  down
> >              36  stale+down
> >              31  active+remapped+backfill_wait
> >              30  stale+active+remapped+backfill_wait
> >              10  active+recovery_unfound+degraded
> >              9   stale+active+undersized+remapped+backfill_wait
> >              7   stale+undersized+degraded+remapped+backfilling+peered
> >              6   active+undersized
> >              6   undersized+degraded+remapped+backfilling+peered
> >              5   active+undersized+remapped+backfill_wait
> >              5   stale+active+recovery_unfound+degraded
> >              4   stale+active+recovery_wait+degraded
> >              4   active+recovery_wait+degraded+remapped
> >              3   undersized+remapped+backfill_wait+peered
> >              3   recovery_unfound+undersized+degraded+remapped+peered
> >              3   stale+recovery_unfound+undersized+degraded+remapped+peered
> >              3   stale+undersized+degraded+peered
> >              3   stale+undersized+remapped+backfill_wait+peered
> >              2   active+recovery_wait+degraded
> >              2   stale+active+recovery_wait+degraded+remapped
> >              1   recovery_wait+undersized+degraded+peered
> >              1   active+recovery_unfound+undersized+degraded
> >              1   active+clean+remapped
> >              1   stale+recovery_wait+undersized+degraded+peered
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux