Re: Luminous OSDs failing with FAILED assert(clone_size.count(clone))

Rabellino Sergio <sergio.rabellino@xxxxxxxx> · Mon, 6 May 2024 18:12:17 +0200

I'm sorry I did a little mistake: our release is mimic, obviously as 
stated in the logged error, and all the ceph stuffs are aligned to mimic.

Il 06/05/2024 10:04, sergio.rabellino@xxxxxxxx ha scritto:
Dear Ceph users,
  I'm pretty new on this list, but I've been using Ceph with satisfaction since 2020. I faced some problems through these years consulting the list archive, but now we're down with a problem that seems without an answer.
After a power failure, we have a bunch of OSDs that during rebalance/refilling goes down with this error:

/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: In function 'uint64_t SnapSet::get_clone_bytes(snapid_t) const' thread 7fdcb2523700 time 2024-05-02 17:18:40.680350
/build/ceph-OM2K9O/ceph-13.2.9/src/osd/osd_types.cc: 5084: FAILED assert(clone_size.count(clone))

  ceph version 13.2.9 (58a2a9b31fd08d8bb3089fce0e312331502ff945) mimic (stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x14e) [0x7fdcd38f63ee]
  2: (()+0x287577) [0x7fdcd38f6577]
  3: (SnapSet::get_clone_bytes(snapid_t) const+0x125) [0x555e697c2725]
  4: (PrimaryLogPG::add_object_context_to_pg_stat(std::shared_ptr<ObjectContext>, pg_stat_t*)+0x2c8) [0x555e696d8208]
  5: (PrimaryLogPG::recover_backfill(unsigned long, ThreadPool::TPHandle&, bool*)+0x1169) [0x555e6973f749]
  6: (PrimaryLogPG::start_recovery_ops(unsigned long, ThreadPool::TPHandle&, unsigned long*)+0x1018) [0x555e69743b98]
  7: (OSD::do_recovery(PG*, unsigned int, unsigned long, ThreadPool::TPHandle&)+0x36a) [0x555e695b07da]
  8: (PGRecovery::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x19) [0x555e69813c99]
  9: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x52d) [0x555e695b220d]
  10: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) [0x7fdcd38fc516]
  11: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7fdcd38fd6d0]
  12: (()+0x76db) [0x7fdcd23e46db]
  13: (clone()+0x3f) [0x7fdcd13ad61f]

  -6171> 2024-05-02 17:18:40.680 7fdcb2523700 -1 *** Caught signal (Aborted) **
  in thread 7fdcb2523700 thread_name:tp_osd_tp

And we're unable to understand what's happening. Yes, actually we're in Luminous but we planned to upgrade to Pacific in June, but before upgrading I believe it's important to have a positive health check.
The pools in error are EC pools.
Some hints ?
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
ing. Sergio Rabellino

Università degli Studi di Torino
Dipartimento di Informatica
Tecnico di Ricerca
Tel +39-0116706701 Fax +39-011751603
C.so Svizzera , 185 - 10149 - Torino

<http://www.di.unito.it>

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx