I'm sorry I did a little mistake: our release is mimic, obviously as
stated in the logged error, and all the ceph stuffs are aligned to mimic.
Il 06/05/2024 10:04, sergio.rabellino@xxxxxxxx ha scritto:
Dear Ceph users,
I'm pretty new on this list, but I've been using Ceph with satisfaction since 2020. I faced some problems through these years consulting the list archive, but now we're down with a problem that seems without an answer.
After a power failure, we have a bunch of OSDs that during rebalance/refilling goes down with this error:
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 17:18:40.680350
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED assert(clone_size.count(clone))
ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fdcd38f63ee]
2: (()+0x287577) [0x7fdcd38f6577]
3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725]
4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2c8) [0x555e696d8208]
5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1169) [0x555e6973f749]
6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x1018) [0x555e69743b98]
7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x36a) [0x555e695b07da]
8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x555e69813c99]
9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x52d) [0x555e695b220d]
10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7fdcd38fc516]
11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0]
12: (()+0x76db) [0x7fdcd23e46db]
13: (clone()+0x3f) [0x7fdcd13ad61f]
-6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) **
in thread 7fdcb2523700 thread_name:tp_osd_tp
And we're unable to understand what's happening. Yes, actually we're in Luminous but we planned to upgrade to Pacific in June, but before upgrading I believe it's important to have a positive health check.
The pools in error are EC pools.
Some hints ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
ing. Sergio Rabellino
Università degli Studi di Torino
Dipartimento di Informatica
Tecnico di Ricerca
Tel +39-0116706701 Fax +39-011751603
C.so Svizzera , 185 - 10149 - Torino
<http://www.di.unito.it>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx