Luminous OSDs failing with FAILED assert(clone_size.count(clone))

sergio.rabellino@xxxxxxxx · Mon, 06 May 2024 08:04:01 -0000

Dear Ceph users,
 I'm pretty new on this list, but I've been using Ceph with satisfaction since 2020. I faced some problems through these years consulting the list archive, but now we're down with a problem that seems without an answer.
After a power failure, we have a bunch of OSDs that during rebalance/refilling goes down with this error:

/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 17:18:40.680350
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED assert(clone_size.count(clone))

 ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fdcd38f63ee]
 2: (()+0x287577) [0x7fdcd38f6577]
 3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725]
 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2c8) [0x555e696d8208]
 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1169) [0x555e6973f749]
 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x1018) [0x555e69743b98]
 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x36a) [0x555e695b07da]
 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x555e69813c99]
 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x52d) [0x555e695b220d]
 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7fdcd38fc516]
 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0]
 12: (()+0x76db) [0x7fdcd23e46db]
 13: (clone()+0x3f) [0x7fdcd13ad61f]

 -6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) **
 in thread 7fdcb2523700 thread_name:tp_osd_tp

And we're unable to understand what's happening. Yes, actually we're in Luminous but we planned to upgrade to Pacific in June, but before upgrading I believe it's important to have a positive health check.
The pools in error are EC pools.
Some hints ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx