Dear Ceph users, I'm pretty new on this list, but I've been using Ceph with satisfaction since 2020. I faced some problems through these years consulting the list archive, but now we're down with a problem that seems without an answer. After a power failure, we have a bunch of OSDs that during rebalance/refilling goes down with this error: /build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 17:18:40.680350 /build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED assert(clone_size.count(clone)) ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fdcd38f63ee] 2: (()+0x287577) [0x7fdcd38f6577] 3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725] 4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2c8) [0x555e696d8208] 5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1169) [0x555e6973f749] 6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x1018) [0x555e69743b98] 7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x36a) [0x555e695b07da] 8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x555e69813c99] 9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x52d) [0x555e695b220d] 10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7fdcd38fc516] 11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0] 12: (()+0x76db) [0x7fdcd23e46db] 13: (clone()+0x3f) [0x7fdcd13ad61f] -6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) ** in thread 7fdcb2523700 thread_name:tp_osd_tp And we're unable to understand what's happening. Yes, actually we're in Luminous but we planned to upgrade to Pacific in June, but before upgrading I believe it's important to have a positive health check. The pools in error are EC pools. Some hints ? _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx