Fwd: Lots of OSDs crashlooping (DRAFT - feedback?)

Benjamin Staffin <bstaffin@xxxxxxxxxxxxxxx> · Mon, 24 Jan 2022 18:44:11 -0500

I have a cluster where 46 out of 120 OSDs have begun crash looping with the
same stack trace (see pasted output below).  The cluster is in a very bad
state with this many OSDs down, unsurprisingly.

The day before this problem showed up, the k8s cluster was under extreme
memory pressure and a lot of pods were OOM killed, including some of the
Ceph OSDs, but after the memory pressure abated everything seemed to
stabilize for about a day.

Then we attempted to set a 4gb memory limit on the OSD pods, because they
had been using upwards of 100gb of ram(!) per OSD after about a month of
uptime, and this was a contributing factor in the cluster-wide OOM
situation.  Everything seemed fine for a few minutes after Rook rolled out
the memory limit, but then OSDs gradually started to crash, a few at a
time, up to about 30 of them.  At this point I reverted the memory limit,
but I don't think the OSDs were hitting their memory limits at all.  In an
attempt to stabilize the cluster, we eventually the Rook operator and set
the osd norebalance, nobackfill, noout, and norecover flags, but at this
point there were 46 OSDs down and pools were hitting BackFillFull.

This is a Rook-ceph deployment on bare-metal kubernetes cluster of 12
nodes.  Each node has two 7TiB nvme disks dedicated to Ceph, and we have 5
BlueStore OSDs per nvme disk (so around 1.4TiB per OSD, which ough to be
fine with a 4gb memory target, right?).  The crash we're seeing looks very
much like the one in this bug report: https://tracker.ceph.com/issues/52220

I don't know how to proceed from here, so any advice would be very much
appreciated.

Ceph version: 16.2.6
Rook version: 1.7.6
Kubernetes version: 1.21.5
Kernel version: 5.4.156-1.el7.elrepo.x86_64
Distro: CentOS 7.9

I've also attached the full log output from one of the crashing OSDs, in
case that is of any use.

----begin stack trace paste----
debug     -1> 2022-01-24T22:09:09.405+0000 7ff8b4315700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
In function 'void ECUtil::HashInfo::append(uint64_t, std::map<int,
ceph::buffer::v15_2_0::list>&)' thread 7ff8b4315700 time
2022-01-24T22:09:09.398961+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.6/rpm/el8/BUILD/ceph-16.2.6/src/osd/ECUtil.cc:
169: FAILED ceph_assert(to_append.size() == cumulative_shard_hashes.size())

 ceph version 16.2.6 (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x158) [0x564f88db554c]
 2: ceph-osd(+0x56a766) [0x564f88db5766]
 3: (ECUtil::HashInfo::append(unsigned long, std::map<int,
ceph::buffer::v15_2_0::list, std::less<int>, std::allocator<std::pair<int
const, ceph::buffer::v15_2_0::list> > >&)+0x14b) [0x564f8910ca0b]
 4: (encode_and_write(pg_t, hobject_t const&, ECUtil::stripe_info_t const&,
std::shared_ptr<ceph::ErasureCodeInterface>&, std::set<int, std::less<int>,
std::allocator<int> > const&, unsigned long, ceph::buffer::v15_2_0::list,
unsigned int, std::shared_ptr<ECUtil::HashInfo>, interval_map<unsigned
long, ceph::buffer::v15_2_0::list, bl_split_merge>&, std::map<shard_id_t,
ceph::os::Transaction, std::less<shard_id_t>,
std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
DoutPrefixProvider*)+0x6ec) [0x564f8929fa7c]
 5: ceph-osd(+0xa5a611) [0x564f892a5611]
 6: (ECTransaction::generate_transactions(ECTransaction::WritePlan&,
std::shared_ptr<ceph::ErasureCodeInterface>&, pg_t, ECUtil::stripe_info_t
const&, std::map<hobject_t, interval_map<unsigned long,
ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
ceph::buffer::v15_2_0::list, bl_split_merge> > > > const&,
std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&,
std::map<hobject_t, interval_map<unsigned long,
ceph::buffer::v15_2_0::list, bl_split_merge>, std::less<hobject_t>,
std::allocator<std::pair<hobject_t const, interval_map<unsigned long,
ceph::buffer::v15_2_0::list, bl_split_merge> > > >*, std::map<shard_id_t,
ceph::os::Transaction, std::less<shard_id_t>,
std::allocator<std::pair<shard_id_t const, ceph::os::Transaction> > >*,
std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
std::set<hobject_t, std::less<hobject_t>, std::allocator<hobject_t> >*,
DoutPrefixProvider*, ceph_release_t)+0x7db) [0x564f892a6dcb]
 7: (ECBackend::try_reads_to_commit()+0x468) [0x564f8927ec28]
 8: (ECBackend::check_ops()+0x24) [0x564f89281cd4]
 9: (CallClientContexts::finish(std::pair<RecoveryMessages*,
ECBackend::read_result_t&>&)+0x1278) [0x564f8929d338]
 10: (ECBackend::complete_read_op(ECBackend::ReadOp&,
RecoveryMessages*)+0x8f) [0x564f8926dfaf]
 11: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x1196) [0x564f89287106]
 12: (ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x18f)
[0x564f89287bdf]
 13: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x52)
[0x564f8908dd12]
 14: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x5de) [0x564f89030d6e]
 15: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x309)
[0x564f88eba1b9]
 16: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x68) [0x564f89117868]
 17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xa58) [0x564f88eda1e8]
 18: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x564f895456c4]
 19: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x564f89548364]
 20: /lib64/libpthread.so.0(+0x814a) [0x7ff8db40e14a]
 21: clone()

debug      0> 2022-01-24T22:09:09.411+0000 7ff8b4315700 -1 *** Caught
signal (Aborted) **
 in thread 7ff8b4315700 thread_name:tp_osd_tp
----end paste----

# ceph status
  cluster:
    id:     a262fadd-b995-4861-9cb0-06c1f1eddaf7
    health: HEALTH_ERR
            1 MDSs report slow metadata IOs
            1277/19235302 objects unfound (0.007%)
            noout,nobackfill,norebalance,norecover flag(s) set
            1 backfillfull osd(s)
            46 osds down
            15 nearfull osd(s)
            Reduced data availability: 1470 pgs inactive, 577 pgs down,
1615 pgs stale
            Possible data damage: 22 pgs recovery_unfound
            Degraded data redundancy: 18956079/115409079 objects degraded
(16.425%), 1942 pgs degraded, 1941 pgs undersized
            13 pool(s) backfillfull
            1309 daemons have recently crashed

  services:
    mon: 3 daemons, quorum b,c,d (age 2d)
    mgr: a(active, since 2d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 120 osds: 74 up (since 27s), 120 in (since 38m); 1817 remapped pgs
         flags noout,nobackfill,norebalance,norecover

  data:
    volumes: 1/1 healthy
    pools:   13 pools, 3234 pgs
    objects: 19.24M objects, 72 TiB
    usage:   80 TiB used, 41 TiB / 122 TiB avail
    pgs:     45.455% pgs not active
             18956079/115409079 objects degraded (16.425%)
             2329606/115409079 objects misplaced (2.019%)
             1277/19235302 objects unfound (0.007%)
             326 undersized+degraded+remapped+backfill_wait+peered
             325 active+clean
             325 stale+active+undersized+degraded+remapped+backfill_wait
             319 active+undersized+degraded+remapped+backfill_wait
             311 stale+undersized+degraded+remapped+backfill_wait+peered
             302 stale+active+clean
             278 stale+down+remapped
             217 down+remapped
             149 active+recovery_wait+undersized+degraded+remapped
             127 stale+active+recovery_wait+undersized+degraded+remapped
             119 stale+recovery_wait+undersized+degraded+remapped+peered
             107 recovery_wait+undersized+degraded+remapped+peered
             57  active+undersized+degraded
             50  stale+active+undersized+degraded
             46  down
             36  stale+down
             31  active+remapped+backfill_wait
             30  stale+active+remapped+backfill_wait
             10  active+recovery_unfound+degraded
             9   stale+active+undersized+remapped+backfill_wait
             7   stale+undersized+degraded+remapped+backfilling+peered
             6   active+undersized
             6   undersized+degraded+remapped+backfilling+peered
             5   active+undersized+remapped+backfill_wait
             5   stale+active+recovery_unfound+degraded
             4   stale+active+recovery_wait+degraded
             4   active+recovery_wait+degraded+remapped
             3   undersized+remapped+backfill_wait+peered
             3   recovery_unfound+undersized+degraded+remapped+peered
             3   stale+recovery_unfound+undersized+degraded+remapped+peered
             3   stale+undersized+degraded+peered
             3   stale+undersized+remapped+backfill_wait+peered
             2   active+recovery_wait+degraded
             2   stale+active+recovery_wait+degraded+remapped
             1   recovery_wait+undersized+degraded+peered
             1   active+recovery_unfound+undersized+degraded
             1   active+clean+remapped
             1   stale+recovery_wait+undersized+degraded+peered
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx