Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]


Hi Dan,

in several threads I have now seen statements like "Does your cluster have the pglog_hardlimit set?". In this context, I would be grateful if you could shed some light on the following:

1) How do I check that?

There is no equivalent "osd get pglog_hardlimit".

2) What is the recommendation?

In the ceph documentation, the only occurrence of the term pglog_hardlimit are release notes for luminous and mimic, stating (mimic)

> A flag called pglog_hardlimit has been introduced, which is off by default. Enabling this flag will limit the
> length of the pg log. In order to enable that, the flag must be set by running ceph osd set pglog_hardlimit
> after completely upgrading to 13.2.2. Once the cluster has this flag set, the length of the pg log will be
> capped by a hard limit. Once set, this flag must not be unset anymore. In luminous, this feature was
> introduced in 12.2.11. Users who are running 12.2.11, and want to continue to use this feature, should
> upgrade to 13.2.5 or later.

How do I know if I want to use this feature? I would need a bit of information about pros and cons. Or should one have this enabled in any case? Would be great if you could provide some insight here.

Thanks and best regards,
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

From: Dan van der Ster <dvanders@xxxxxxxxx>
Sent: 25 January 2022 11:56:38
To: Benjamin Staffin
Cc: Ceph Users; Matthew Wilder; Tara Fly
Subject:  Re: Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

Hi Benjamin,

Apologies that I can't help for the bluestore issue.

But that huge 100GB OSD consumption could be related to similar
reports linked here:

Does your cluster have the pglog_hardlimit set?

# ceph osd dump | grep pglog
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit

Do you have PGs with really long pglogs?

# ceph pg dump | grep + | awk '{print $10, $11, $12}' | sort -n | tail

-- Dan

On Tue, Jan 25, 2022 at 12:44 AM Benjamin Staffin
<bstaffin@xxxxxxxxxxxxxxx> wrote:
> I have a cluster where 46 out of 120 OSDs have begun crash looping with the
> same stack trace (see pasted output below).  The cluster is in a very bad
> state with this many OSDs down, unsurprisingly.
> The day before this problem showed up, the k8s cluster was under extreme
> memory pressure and a lot of pods were OOM killed, including some of the
> Ceph OSDs, but after the memory pressure abated everything seemed to
> stabilize for about a day.
> Then we attempted to set a 4gb memory limit on the OSD pods, because they
> had been using upwards of 100gb of ram(!) per OSD after about a month of
> uptime, and this was a contributing factor in the cluster-wide OOM
> situation.  Everything seemed fine for a few minutes after Rook rolled out
> the memory limit, but then OSDs gradually started to crash, a few at a
> time, up to about 30 of them.  At this point I reverted the memory limit,
> but I don't think the OSDs were hitting their memory limits at all.  In an
> attempt to stabilize the cluster, we eventually the Rook operator and set
> the osd norebalance, nobackfill, noout, and norecover flags, but at this
> point there were 46 OSDs down and pools were hitting BackFillFull.
> This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
> nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
> BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
> fine with a 4gb memory target, right?).  The crash we're seeing looks very
> much like the one in this bug report:
> I don't know how to proceed from here, so any advice would be very much
> appreciated.
> Ceph version: 16.2.6
> Rook version: 1.7.6
> Kubernetes version: 1.21.5
> Kernel version: 5.4.156-1.el7.elrepo.x86_64
> Distro: CentOS 7.9
> I've also attached the full log output from one of the crashing OSDs, in
> case that is of any use.
> ----begin stack trace paste----
> debug     -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/
> In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int,
> ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
> 2022-01-24T22:09:09.398961+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/
> 169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())
>  ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x158) [0x564f88db554c]
>  2: ceph-osd(+0x56a766) [0x564f88db5766]
>  3: (ECUtil::HashInfo::append(unsigned long, std::map<int,
> ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int
> const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
>  4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>,
> std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list,
> unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned
> long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
>  5: ceph-osd(+0xa5a611) [0x564f892a5611]
>  6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
> std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t
> const&, std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
> std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&,
> std::map<hobject_t, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
> std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
> ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t,
> ceph::os::Transaction, std::less<shard_id_t>,
> std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
> DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
>  7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
>  8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
>  9: (CallClientContexts::finish(std::pair<RecoveryMessages*,
> ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
>  10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
> RecoveryMessages*)+0x8f) [0x564f8926dfaf]
>  11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
> RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
>  12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f)
> [0x564f89287bdf]
>  13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52)
> [0x564f8908dd12]
>  14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
> ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
>  15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
> boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309)
> [0x564f88eba1b9]
>  16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
> boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
>  17: (OSD::ShardedOpWQ::_process(unsigned int,
> ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
>  18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
> [0x564f895456c4]
>  19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
>  20: /lib64/ [0x7ff8db40e14a]
>  21: clone()
> debug      0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught
> signal (Aborted) **
>  in thread 7ff8b4315700 thread_name:tp_osd_tp
> ----end paste----
> # ceph status
>   cluster:
>     id:     a262fadd-b995-4861-9cb0-06c1f1eddaf7
>     health: HEALTH_ERR
>             1 MDSs report slow metadata IOs
>             1277/19235302 objects unfound (0.007%)
>             noout,nobackfill,norebalance,norecover flag(s) set
>             1 backfillfull osd(s)
>             46 osds down
>             15 nearfull osd(s)
>             Reduced data availability: 1470 pgs inactive, 577 pgs down,
> 1615 pgs stale
>             Possible data damage: 22 pgs recovery_unfound
>             Degraded data redundancy: 18956079/115409079 objects degraded
> (16.425%), 1942 pgs degraded, 1941 pgs undersized
>             13 pool(s) backfillfull
>             1309 daemons have recently crashed
>   services:
>     mon: 3 daemons, quorum b,c,d (age 2d)
>     mgr: a(active, since 2d)
>     mds: 1/1 daemons up, 1 hot standby
>     osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs
>          flags noout,nobackfill,norebalance,norecover
>   data:
>     volumes: 1/1 healthy
>     pools:   13 pools, 3234 pgs
>     objects: 19.24M objects, 72 TiB
>     usage:   80 TiB used, 41 TiB / 122 TiB avail
>     pgs:     45.455% pgs not active
>              18956079/115409079 objects degraded (16.425%)
>              2329606/115409079 objects misplaced (2.019%)
>              1277/19235302 objects unfound (0.007%)
>              326 undersized+degraded+remapped+backfill_wait+peered
>              325 active+clean
>              325 stale+active+undersized+degraded+remapped+backfill_wait
>              319 active+undersized+degraded+remapped+backfill_wait
>              311 stale+undersized+degraded+remapped+backfill_wait+peered
>              302 stale+active+clean
>              278 stale+down+remapped
>              217 down+remapped
>              149 active+recovery_wait+undersized+degraded+remapped
>              127 stale+active+recovery_wait+undersized+degraded+remapped
>              119 stale+recovery_wait+undersized+degraded+remapped+peered
>              107 recovery_wait+undersized+degraded+remapped+peered
>              57  active+undersized+degraded
>              50  stale+active+undersized+degraded
>              46  down
>              36  stale+down
>              31  active+remapped+backfill_wait
>              30  stale+active+remapped+backfill_wait
>              10  active+recovery_unfound+degraded
>              9   stale+active+undersized+remapped+backfill_wait
>              7   stale+undersized+degraded+remapped+backfilling+peered
>              6   active+undersized
>              6   undersized+degraded+remapped+backfilling+peered
>              5   active+undersized+remapped+backfill_wait
>              5   stale+active+recovery_unfound+degraded
>              4   stale+active+recovery_wait+degraded
>              4   active+recovery_wait+degraded+remapped
>              3   undersized+remapped+backfill_wait+peered
>              3   recovery_unfound+undersized+degraded+remapped+peered
>              3   stale+recovery_unfound+undersized+degraded+remapped+peered
>              3   stale+undersized+degraded+peered
>              3   stale+undersized+remapped+backfill_wait+peered
>              2   active+recovery_wait+degraded
>              2   stale+active+recovery_wait+degraded+remapped
>              1   recovery_wait+undersized+degraded+peered
>              1   active+recovery_unfound+undersized+degraded
>              1   active+clean+remapped
>              1   stale+recovery_wait+undersized+degraded+peered
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]

  Powered by Linux