We are running rook-ceph deployed as a operator in kubernetes with rook version 1.10.8 and ceph 17.2.5. Its working fine but we are seeing frequent OSD daemon crash in 3-4 days and restarts without any problem also we are seeing flapping osds i.e osd up down. Recently daemon crash happened for 2 OSDs at same time on different nodes with below error in crash info : -305> 2023-12-17T14:50:14.413+0000 7f53b5f91700 -1 *** Caught signal (Aborted) ** in thread 7f53b5f91700 thread_name:tp_osd_tp ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable) 1: /lib64/libpthread.so.0(+0x12cf0) [0x7f53d93ddcf0] 2: gsignal() 3: abort() 4: /lib64/libc.so.6(+0x21d79) [0x7f53d8025d79] 5: /lib64/libc.so.6(+0x47456) [0x7f53d804b456] 6: (MOSDRepOp::encode_payload(unsigned long)+0x2d0) [0x55acc0f81730] 7: (Message::encode(unsigned long, int, bool)+0x2e) [0x55acc140ec2e] 8: (ProtocolV2::send_message(Message*)+0x25e) [0x55acc16a5aae] 9: (AsyncConnection::send_message(Message*)+0x18e) [0x55acc167dc4e] 10: (OSDService::send_message_osd_cluster(int, Message*, unsigned int)+0x2bd) [0x55acc0b4b11d] 11: (ReplicatedBackend::issue_op(hobject_t const&, eversion_t const&, unsigned long, osd_reqid_t, eversion_t, eversion_t, hobject_t, hobject_t, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> > const&, std::optional<pg_hit_set_history_t>&, ReplicatedBackend::InProgressOp*, ceph::os::Transaction&)+0x6c8) [0x55acc0f69368] 12: (ReplicatedBackend::submit_transaction(hobject_t const&, object_stat_sum_t const&, eversion_t const&, std::unique_ptr<PGTransaction, std::default_delete<PGTransaction> >&&, eversion_t const&, eversion_t const&, std::vector<pg_log_entry_t, std::allocator<pg_log_entry_t> >&&, std::optional<pg_hit_set_history_t>&, Context*, unsigned long, osd_reqid_t, boost::intrusive_ptr<OpRequest>)+0x5e7) [0x55acc0f6c907] 13: (PrimaryLogPG::issue_repop(PrimaryLogPG::RepGather*, PrimaryLogPG::OpContext*)+0x50d) [0x55acc0c92ebd] 14: (PrimaryLogPG::execute_ctx(PrimaryLogPG::OpContext*)+0xd25) [0x55acc0cf0295] 15: (PrimaryLogPG::do_op(boost::intrusive_ptr<OpRequest>&)+0x288d) [0x55acc0cf78fd] 16: (OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x1c0) [0x55acc0b56900] 17: (ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x6d) [0x55acc0e552ad] 18: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0x115f) [0x55acc0b69dbf] 19: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x435) [0x55acc12c78c5] 20: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55acc12c9fe4] 21: /lib64/libpthread.so.0(+0x81ca) [0x7f53d93d31ca] 22: clone() It has also has below errors before the crash: scrub-queue::*remove_from_osd_queue* removing pg[2.4f0] failed. State was: unregistering Please help to troubleshoot the issue and fix it Already posted on ceph tracker but no reply there since 3-4 days _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx